diff --git a/.claude/skills/metadata-exposure-enrichment/SKILL.md b/.claude/skills/metadata-exposure-enrichment/SKILL.md new file mode 100644 index 0000000..2fc43c0 --- /dev/null +++ b/.claude/skills/metadata-exposure-enrichment/SKILL.md @@ -0,0 +1,104 @@ +--- +name: metadata-exposure-enrichment +description: Enrich dbt exposure definitions by querying Metabase directly via MCP. Discovers dashboard cards, maps table and column references to dbt models, audits the existing _exposures.yml for gaps, and writes back a fully enriched exposure. Triggers include "enrich exposure", "exposure enrichment", "document dashboard", "update exposures", "what does the dashboard use". +--- + +# Exposure Enrichment + +Discovers what a Metabase dashboard actually contains and writes that context back into dbt exposures. Works with Metabase MCP and Postgres MCP only, no metadata platform required. + +## How It Works + +1. **Parse `$ARGUMENTS`** -- Dashboard name, dashboard ID (e.g. `2`), or `all`. If empty, default to `all`. Extract any optional `--dry-run` flag (report only, no file write). +2. **Check file exists** -- Read `dbt/models/marts/_exposures.yml`. If missing, stop and output: + > ERROR: `dbt/models/marts/_exposures.yml` not found. Create a barebones version first with `name`, `type`, `url`, and `depends_on` fields, then re-run this skill. +3. **Discover via Metabase MCP** -- Execute sequentially: + a. `metabase-list-dashboards` -- confirm the target dashboard exists and get its ID + b. `metabase-get-dashboard` with the dashboard ID -- extract the `dashcards` array to get all `card_id` values + c. For each `card_id`: call `metabase-get-question` -- collect card name, display type, and `dataset_query` (MBQL `source-table` or native SQL) + d. `metabase-get-database-metadata` for the database -- map internal Metabase table IDs to real table names in the `marketing` schema + e. `metabase-get-current-user` -- capture email for the exposure `owner` field +4. **Cross-reference to dbt** -- For each card's source table, determine whether it maps to a dbt mart model or a raw source: + - Read `dbt/models/marts/` SQL files and `_marts.yml` to confirm mart models + - Read `dbt/models/staging/_sources.yml` to identify raw source tables + - Mart models become `ref('model_name')` in `depends_on` + - Raw source tables become `source('marketing_raw', 'table_name')` in `depends_on` + - Flag any table that doesn't map to either +5. **Audit the existing exposure** -- Read `_exposures.yml` and check each exposure: + - Is `description` present and non-empty? + - Is `owner` present with name and email? + - Is `maturity` set? + - Does `depends_on` include ALL models/sources discovered in step 3? + - Are card-level details documented in the description? + - Are key columns documented? +6. **Report** -- Print a structured summary before offering any write: + - Dashboard: name, URL, total card count + - Card inventory: ID, name, display type, source table, columns used (aggregation + breakout) + - Audit gaps: what `_exposures.yml` is missing vs what was discovered + - End with: `GAPS: N fields missing | Cards: N discovered | Models: N mapped` +7. **Offer to enrich** -- Propose the enriched YAML and confirm with user before writing. If `--dry-run`, print proposed content only. On confirmation: + - Write to `dbt/models/marts/_exposures.yml` + - Report what changed (before vs after) + +## Description Format + +Plain text with bracketed headers. Same convention as `_marts.yml` descriptions. dbt YAML and OpenMetadata both render plain text. + +**Exposure description**: +`[Business Purpose]` what decisions the dashboard drives and who uses it. +`[Cards]` list each card: name, chart type, what it measures. +`[Key Columns]` columns surfaced in the dashboard (aggregation columns + breakout dimensions). +`[Data Sources]` which dbt models and sources feed the dashboard, and how. +`[Known Issues / Caveats]` date range defaults, missing channels, filter behavior, archived cards. + +## Reference + +**Target dashboard**: "Agentic Data Modeling Demo" (ID=2), URL `http://localhost:3000/dashboard/2` + +**Known card inventory on dashboard 2**: +- Card 40: ROAS (smartscalar) -- avg(roas) from campaign_performance, grouped by date +- Card 41: CR% (smartscalar) -- avg(conversion_rate) from campaign_performance, grouped by date +- Card 42: Target Revenue (progress) -- sum(total_revenue) from campaign_performance, grouped by date +- Card 43: Daily Spend by Channel (bar) -- sum(spend) from campaigns_daily, grouped by date + channel +- Card 44: Desktop Per Channel (pie) -- sum(desktop_sessions) from campaign_performance, grouped by channel +- Card 45: Mobile Per Channel (pie) -- sum(mobile_sessions) from campaign_performance, grouped by channel + +**Standalone card NOT on dashboard 2** (do not include): +- Card 38: ROAS (table, native SQL) -- archived, not on any active dashboard + +**Metabase table to dbt model map**: +- `campaign_performance` -> mart model, use `ref('campaign_performance')` +- `campaigns_daily` -> raw source table staged as `stg_campaigns_daily`, use `source('marketing_raw', 'campaigns_daily')` +- `daily_summary` -> mart model, not directly queried by any card but is a rollup of campaign_performance + +**dbt mart models**: `campaign_performance`, `daily_summary`, `user_journey`, `channel_attribution` +**Files**: `dbt/models/marts/_exposures.yml`, `dbt/models/marts/_marts.yml` +**Sources**: `dbt/models/staging/_sources.yml` (source name: `marketing_raw`, schema: `marketing`) + +## Output Format + +``` +## Exposure Enrichment: {dashboard_name} (ID={id}) + +### Dashboard Discovery +- Cards on dashboard: {count} +- Source tables: {table} ({n} cards), ... +- dbt mapping: {table} -> {ref or source} + +### Card Inventory +| Card | Name | Type | Source Table | Columns Used | +|------|------|------|--------------|--------------| +| ... | ... | ... | ... | ... | + +### Audit: _exposures.yml gaps +- description: {PRESENT | MISSING} +- owner: {PRESENT | MISSING} +- maturity: {PRESENT | MISSING} +- depends_on: {complete | missing: list} +- card documentation: {PRESENT | MISSING} + +### Proposed enrichment +{full enriched YAML} + +GAPS: {n} fields missing | Cards: {n} discovered | Models: {n} mapped +``` \ No newline at end of file diff --git a/.mcp.json b/.mcp.json index cedc2cf..15ce17e 100644 --- a/.mcp.json +++ b/.mcp.json @@ -23,6 +23,14 @@ "env": { "AUTH_HEADER": "Bearer " } + }, + "metabase": { + "command": "npx", + "args": ["-y", "@getnao/metabase-mcp-server@latest"], + "env": { + "METABASE_URL": "http://localhost:3000", + "METABASE_API_KEY": "" + } } } } diff --git a/QUICKSTART.md b/QUICKSTART.md index 1651063..eb33f41 100644 --- a/QUICKSTART.md +++ b/QUICKSTART.md @@ -11,6 +11,7 @@ A step by step guide on how to get started with this project. OPENMETADATA_JWT_TOKEN=your_jwt_token_here METABASE_USERNAME=your_metabase_username METABASE_PASSWORD=your_metabase_password +METABASE_API_KEY=your_metabase_api_key_here ``` 2. Run the docker container: @@ -91,6 +92,23 @@ Replace `` in `.mcp.json` with the token you genera > **Docs**: [OpenMetadata MCP Reference](https://docs.open-metadata.org/v1.10.x/how-to-guides/mcp/reference) +### Metabase MCP + +The `/metadata-exposure-enrichment` skill queries Metabase dashboards directly via the [nao-metabase-mcp-server](https://github.com/getnao/nao-mcp-servers). It is already configured in `.mcp.json`. You need to supply an API key. + +**Generate an API key:** +1. Go to `http://localhost:3000/admin/settings/authentication/api-keys` +2. Create a new API key +3. Add it to your `.env`: + +```bash +METABASE_API_KEY=your_api_key_here +``` + +Replace `` in `.mcp.json` with your key. + +> **Docs**: [nao-metabase-mcp-server](https://github.com/getnao/nao-mcp-servers) + ### Verify `.mcp.json` Your `.mcp.json` should look like this (already included in the repo): @@ -121,6 +139,14 @@ Your `.mcp.json` should look like this (already included in the repo): "env": { "AUTH_HEADER": "Bearer " } + }, + "metabase": { + "command": "npx", + "args": ["-y", "@getnao/metabase-mcp-server@latest"], + "env": { + "METABASE_URL": "http://localhost:3000", + "METABASE_API_KEY": "" + } } } } @@ -136,6 +162,7 @@ Then use the MCP servers to ask questions such as: - "Who owns the Agentic Data Modeling Demo dashboard?" - "Is `user_journey` ready to be consumed by an AI agent?" - "Create a business glossary from our dbt models" +- "Enrich the exposure for the Agentic Data Modeling Demo dashboard" --- diff --git a/README.md b/README.md index 732a202..8320279 100644 --- a/README.md +++ b/README.md @@ -25,12 +25,13 @@ This project connects Claude to the data stack through **two MCP servers**, givi |---|---|---| | **OpenMetadata MCP** | Metadata catalog — lineage, search, glossaries, entity details | `search_metadata`, `get_entity_lineage`, `get_entity_details`, `create_glossary_term` | | **PostgreSQL MCP** | Direct database access — query data, profile columns, validate models | `execute_sql`, `list_tables`, `list_table_stats` | +| **Metabase MCP** | Direct dashboard access — discover cards, questions, database metadata | `metabase-list-dashboards`, `metabase-get-dashboard`, `metabase-get-question` | The **PostgreSQL MCP** uses [Google GenAI Toolbox](https://github.com/googleapis/genai-toolbox) (pre-downloaded binary in `bin/toolbox`) to give Claude direct SQL access to the local PostgreSQL instance. This enables data profiling, edge case discovery, and validation queries — capabilities used heavily by the AI Readiness skill. The **OpenMetadata MCP** connects to the OpenMetadata server's native MCP endpoint, providing metadata search, lineage tracing, and glossary management through natural language. -Both servers are configured in `.mcp.json` at the project root, with permissions managed in `.claude/settings.local.json`. +All three servers are configured in `.mcp.json` at the project root, with permissions managed in `.claude/settings.local.json`. ### What this enables @@ -43,7 +44,7 @@ Both servers are configured in `.mcp.json` at the project root, with permissions ## 🛠️ Claude Code Skills -The project includes three custom **Claude Code skills** (in `.claude/skills/`) that encode repeatable data engineering workflows as slash commands. These skills combine the OpenMetadata and PostgreSQL MCP tools with local file analysis to automate common tasks: +The project includes four custom **Claude Code skills** (in `.claude/skills/`) that encode repeatable data engineering workflows as slash commands. These skills combine the OpenMetadata and PostgreSQL MCP tools with local file analysis to automate common tasks: ### `/metadata-impact-analysis` Analyze downstream impact before making schema changes. Traces lineage through dbt models and dashboards to identify what breaks if a column is renamed, dropped, or its type changes. @@ -54,6 +55,9 @@ Audit and enrich dbt mart models for AI consumption. Checks schema quality, quer ### `/metadata-glossary` Manage an OpenMetadata glossary derived from dbt models. Parses dbt YAML for column names and descriptions, groups them into business categories, and creates/syncs glossary terms via OpenMetadata. +### `/metadata-exposure-enrichment` +Enrich dbt exposure definitions by querying Metabase directly via MCP. Discovers dashboard cards, maps table and column references to dbt models, audits the existing `_exposures.yml` for gaps, and writes back a fully enriched exposure. + ## 📚 Documentation This project includes comprehensive documentation to help you get started: @@ -90,8 +94,8 @@ This setup enables a complete data analytics workflow where: 2. dbt transforms and models the data locally 3. Metabase provides interactive dashboards 4. OpenMetadata centralizes metadata from all components via **YAML-based ingestion** (not UI), providing unified lineage and metadata views -5. Claude connects via two MCP servers (OpenMetadata + PostgreSQL) for metadata exploration and direct data access -6. Custom skills (`/metadata-impact-analysis`, `/metadata-ai-readiness`, `/metadata-glossary`) automate repeatable data engineering workflows +5. Claude connects via three MCP servers (OpenMetadata + PostgreSQL + Metabase) for metadata exploration, direct data access, and dashboard discovery +6. Custom skills (`/metadata-impact-analysis`, `/metadata-ai-readiness`, `/metadata-glossary`, `/metadata-exposure-enrichment`) automate repeatable data engineering workflows **Key Feature:** All OpenMetadata ingestion is configured through YAML files, enabling Infrastructure as Code (IaC) practices. Ingestion runs on-demand using Docker Compose profiles, giving you control over when metadata is synchronized. While OpenMetadata provides a UI for configuration, this project uses YAML files for version control, automation, and reproducibility. @@ -101,8 +105,9 @@ This setup enables a complete data analytics workflow where: │ └── skills/ # Custom Claude Code skills │ ├── metadata-impact-analysis/ │ ├── metadata-ai-readiness/ -│ └── metadata-glossary/ -├── .mcp.json # MCP server definitions (Postgres + OpenMetadata) +│ ├── metadata-glossary/ +│ └── metadata-exposure-enrichment/ +├── .mcp.json # MCP server definitions (Postgres + OpenMetadata + Metabase) ├── bin/ │ └── toolbox # Google GenAI Toolbox binary (Postgres MCP) ├── dbt/ # dbt project diff --git a/dbt/models/marts/_exposures.yml b/dbt/models/marts/_exposures.yml new file mode 100644 index 0000000..6632224 --- /dev/null +++ b/dbt/models/marts/_exposures.yml @@ -0,0 +1,37 @@ +version: 2 + +exposures: + - name: agentic_data_modeling_demo + type: dashboard + maturity: low + url: http://localhost:3000/dashboard/2 + description: > + [Business Purpose] Marketing performance dashboard used to monitor campaign ROI, + conversion efficiency, revenue targets, and channel-level spend and device breakdown. + Supports daily decision-making on budget allocation and channel optimization. + + [Cards] 6 cards: + (1) ROAS -- smartscalar showing average return on ad spend over time. + (2) CR% -- smartscalar showing average conversion rate over time. + (3) Target Revenue -- progress bar tracking cumulative revenue against a 100k goal. + (4) Daily Spend by Channel -- stacked bar chart of daily spend broken down by marketing channel. + (5) Desktop Per Channel -- pie chart of total desktop sessions by channel. + (6) Mobile Per Channel -- pie chart of total mobile sessions by channel. + + [Key Columns] roas, conversion_rate, total_revenue, spend, desktop_sessions, + mobile_sessions, date, channel. + + [Data Sources] campaign_performance mart (cards 1-3, 5-6) and campaigns_daily + raw source (card 4). daily_summary is an indirect dependency as a rollup of + campaign_performance. + + [Known Issues / Caveats] Dashboard date filter defaults to past 7 days. + Card 43 (Daily Spend) queries the raw campaigns_daily source table directly + rather than a mart model. + owner: + name: Alejandro Aboy + email: aboyalejandro@gmail.com + depends_on: + - ref('campaign_performance') + - ref('daily_summary') + - source('marketing_raw', 'campaigns_daily')