diff --git a/.claude/skills/metadata-ai-readiness/SKILL.md b/.claude/skills/metadata-ai-readiness/SKILL.md new file mode 100644 index 0000000..d5a4039 --- /dev/null +++ b/.claude/skills/metadata-ai-readiness/SKILL.md @@ -0,0 +1,65 @@ +--- +name: metadata-ai-readiness +description: Audit and enrich dbt mart models for AI consumption. Applies dbt Agent Skills' writing-documentation standards as audit criteria and automates the discovering-data methodology via Postgres MCP. Writes enriched descriptions back to dbt YAML. Triggers include "ai readiness", "is this model ready", "enrich", "audit yaml", "pre-merge check". +--- + +# AI Readiness + +Automates the standards defined in dbt Agent Skills' `writing-documentation` and `discovering-data` references. Applies them as an audit checklist and runs the data profiling via Postgres MCP so you don't have to do it table by table. + +## How It Works + +1. **Parse `$ARGUMENTS`** — Model name (e.g. `campaign_performance`), or `all`/empty for all mart models. +2. **dbt schema audit** — Apply `writing-documentation` as a checklist. Read `dbt/models/marts/{model}.sql` + `dbt/models/marts/_marts.yml`. Check: + - Model has a description + - All SQL columns are present in YAML + - Descriptions say something beyond the column name (flag any that merely restate it) + - Grain columns have `not_null` + `unique` tests +3. **Query the database** — Automate the `discovering-data` 6-step methodology via Postgres MCP against `localhost:5432`: + - **Grain validation**: `SELECT COUNT(*), COUNT(DISTINCT {grain_columns}) FROM {model}` — confirm declared grain holds + - **Column profiling**: NULLs %, min/max, distinct counts on every metric column + - **Edge case discovery**: zeros vs NULLs (COALESCEd columns behave differently from NULLIFed columns), skewed distributions, date gaps + - **Example queries**: 2-3 queries demonstrating how to use the model for common business questions + - All findings become candidates for `[Known Issues / Caveats]` entries +4. **Report** — Pass/fail checklist per model with 2 sections: **dbt Schema** + **Query Guidance**. End with `PASS: X/Y | Auto-fixable: N | Manual: N`. +5. **Offer fixes** — Propose changes, confirm with user, edit `_marts.yml`: + - **Can fix**: missing/thin descriptions (model + columns), missing column YAML entries + - **Cannot fix** (flag only): missing dbt tests (print snippet for user to add) + +## Description Format + +Plain text with bracketed headers (no markdown, dbt YAML renders plain text). + +**Tables**: +`[Business Purpose]` what business questions it answers and why it exists. +`[How It's Used]` who consumes it and what decisions it drives. +`[Data Grain]` one row = what. Source lineage (which staging/intermediate models feed it). +`[Known Issues / Caveats]` exclusions, NULLs, COALESCEs, edge cases found in profiling. + +**Columns**: +`[Business Purpose]` what the value represents. Never restate the column name. +`[Known Issues / Caveats]` only when real caveats exist; skip if none found. + +## Reference + +**Mart models**: `campaign_performance`, `daily_summary` +**Files**: SQL at `dbt/models/marts/{model}.sql`, YAML at `dbt/models/marts/_marts.yml` +**Upstream**: trace `{{ ref('...') }}` calls in SQL to find source models + +**Grain map**: +- `campaign_performance` → composite: `campaign_id` + `date` +- `daily_summary` → single: `date` + +**dbt Agent Skills standards this skill automates**: +- `writing-documentation` — "Never generate documentation which simply restates the entity's name. Describe why, not just what." +- `discovering-data` — 6-step methodology: inventory, sample, grain check, profile, validate relationships, document findings. + +## Output Format + +Checklist per model with 2 sections: + +**dbt Schema** (description exists, all SQL columns in YAML, descriptions pass writing-documentation check, grain columns have tests) + +**Query Guidance** (grain holds, column profiles, edge cases found, example queries) + +End with: `PASS: X/Y | Auto-fixable: N | Manual: N` \ No newline at end of file diff --git a/dbt/models/marts/_marts.yml b/dbt/models/marts/_marts.yml index 6b8ad95..1b45459 100644 --- a/dbt/models/marts/_marts.yml +++ b/dbt/models/marts/_marts.yml @@ -3,154 +3,193 @@ version: 2 models: - name: campaign_performance description: | - Complete campaign performance fact table combining spend, impressions, clicks, - sessions, and conversions. This is the primary table for analyzing campaign ROI - and performance metrics. + [Business Purpose] Answers how each campaign performs day-over-day across spend efficiency, audience engagement, and revenue attribution. Primary table for diagnosing which campaigns justify continued investment and which need reallocation. - Grain: One row per campaign per date + [How It's Used] Marketing analysts use it for daily performance reviews and budget reallocation decisions. BI dashboards pull ROAS, CPA, and conversion rate trends from this table. AI agents use it to surface underperforming campaigns and recommend optimizations. - Key metrics: - - Advertising metrics (spend, impressions, clicks, CTR, CPC) - - Session metrics (sessions, users, engagement, device breakdown) - - Conversion metrics (conversions, revenue, AOV) - - Calculated KPIs (conversion rate, ROAS, CPA, click-to-session rate) + [Data Grain] One row per campaign per date. Joins stg_campaigns_daily (spine) with stg_sessions and stg_conversions via LEFT JOIN on campaign_id + date. + + [Known Issues / Caveats] Session and conversion columns are COALESCE'd to 0 on LEFT JOIN misses — a zero value means "no matching session/conversion data", not "measured zero". avg_order_value reads 0 when there are no conversions (41 of 400 rows), which is misleading — filter to total_conversions > 0 before averaging. total_sessions is uniformly 110 across all campaign-dates in the current dataset, suggesting synthetic or incomplete session source data. Date 2025-12-20 is missing from the source and propagates as a gap here. Calculated KPIs (conversion_rate, roas, cost_per_conversion, click_to_session_rate) fall back to 0 when their denominator is zero rather than returning NULL. columns: - name: campaign_id - description: Unique identifier for the campaign + description: "[Business Purpose] Identifies which campaign a row belongs to. Join key for linking to campaign metadata or other campaign-scoped tables. Part of the composite grain with date." data_tests: - not_null - name: date - description: Date of the performance metrics + description: "[Business Purpose] Calendar date the metrics were recorded. Part of the composite grain with campaign_id. Use for time-series analysis and trend detection." data_tests: - not_null - name: campaign_name - description: Name of the campaign + description: "[Business Purpose] Human-readable label assigned to the campaign at creation. Use for display in reports and dashboards — not stable as a join key since names can be edited." - name: channel - description: Marketing channel (Meta, Google Ads, LinkedIn, etc.) + description: "[Business Purpose] Marketing platform where the campaign runs (google_ads, meta, linkedin, tiktok, twitter, pinterest, reddit, snapchat). Use for channel-mix analysis and cross-platform benchmarking." - name: status - description: Campaign status (active, paused, etc.) + description: "[Business Purpose] Operational state of the campaign (active, paused). Paused campaigns still have historical rows — filter to status = 'active' for live performance views." # Spend metrics - name: daily_budget - description: Daily budget allocated for the campaign in dollars + description: "[Business Purpose] Maximum amount the campaign is configured to spend per day in dollars. Compare against actual spend to assess pacing and budget headroom." - name: spend - description: Actual amount spent on the campaign in dollars + description: "[Business Purpose] Actual dollars spent on the campaign for this date. Primary cost input for efficiency KPIs (ROAS, CPA, CPC). Always > 0 in current data — no zero-spend days observed." # Impression & Click metrics - name: impressions - description: Number of ad impressions served + description: "[Business Purpose] Number of times ads were shown to users. Top-of-funnel volume metric — divide clicks by impressions to get CTR." - name: clicks - description: Number of clicks on ads + description: "[Business Purpose] Number of ad clicks recorded by the ad platform. Measures intent signal from impressions. Compare against total_sessions to detect click-to-session drop-off." - name: ctr - description: Click-through rate (clicks / impressions) + description: "[Business Purpose] Click-through rate: clicks divided by impressions. Measures ad creative effectiveness. Sourced directly from stg_campaigns_daily, not recalculated here." - name: cpc - description: Cost per click in dollars + description: "[Business Purpose] Cost per click: spend divided by clicks. Measures auction efficiency for the campaign. Sourced from stg_campaigns_daily." # Session metrics - name: total_sessions - description: Total number of website sessions from this campaign + description: | + [Business Purpose] Count of distinct website sessions attributed to this campaign on this date. Measures how effectively ad clicks convert into site visits. + [Known Issues / Caveats] COALESCE'd to 0 when no session data matches — zero means "no data", not "no sessions". Currently reads 110 for every campaign-date in the dataset, which is suspiciously uniform and likely reflects synthetic source data. - name: unique_users - description: Number of unique users who had sessions + description: | + [Business Purpose] Count of distinct users who had at least one session from this campaign on this date. Lower than total_sessions when users visit multiple times. + [Known Issues / Caveats] COALESCE'd to 0 on LEFT JOIN miss. - name: avg_session_duration - description: Average session duration in seconds + description: | + [Business Purpose] Mean session length in seconds for sessions attributed to this campaign-date. Proxy for content engagement quality. + [Known Issues / Caveats] COALESCE'd to 0 when no sessions match — a 0 here is not a real measurement. - name: avg_pages_per_session - description: Average number of pages viewed per session + description: | + [Business Purpose] Mean number of pages viewed per session. Indicates depth of user engagement with the site after clicking through. + [Known Issues / Caveats] COALESCE'd to 0 when no sessions match. - name: engaged_sessions - description: Number of sessions classified as engaged + description: | + [Business Purpose] Count of sessions classified as "engaged" by the engagement_level field in stg_sessions. Useful for filtering out bounce-like visits when calculating quality metrics. + [Known Issues / Caveats] COALESCE'd to 0 on LEFT JOIN miss. - name: mobile_sessions - description: Number of sessions from mobile devices + description: | + [Business Purpose] Sessions from mobile devices. Use alongside desktop_sessions for device-mix analysis and to inform creative strategy (mobile-optimized vs desktop). + [Known Issues / Caveats] COALESCE'd to 0 on LEFT JOIN miss. - name: desktop_sessions - description: Number of sessions from desktop devices + description: | + [Business Purpose] Sessions from desktop devices. Complement to mobile_sessions for device segmentation. + [Known Issues / Caveats] COALESCE'd to 0 on LEFT JOIN miss. # Conversion metrics - name: total_conversions - description: Total number of conversions attributed to this campaign + description: | + [Business Purpose] Count of distinct conversions attributed to this campaign-date. Bottom-of-funnel outcome metric used to calculate conversion_rate, ROAS, and CPA. + [Known Issues / Caveats] COALESCE'd to 0 when no conversions match (41 of 400 rows). Zero means "no attributed conversions", not a measurement error. - name: converting_users - description: Number of unique users who converted + description: | + [Business Purpose] Count of distinct users who converted. Lower than total_conversions when a single user converts multiple times. + [Known Issues / Caveats] COALESCE'd to 0 on LEFT JOIN miss. - name: total_revenue - description: Total revenue from conversions in dollars + description: | + [Business Purpose] Sum of conversion values in dollars attributed to this campaign-date. Primary revenue input for ROAS calculation. + [Known Issues / Caveats] COALESCE'd to 0 when no conversions exist. Zero revenue is always paired with zero conversions. - name: avg_order_value - description: Average order value (revenue per conversion) + description: | + [Business Purpose] Mean revenue per conversion (total_revenue / total_conversions). Indicates the value profile of customers acquired through this campaign. + [Known Issues / Caveats] COALESCE'd to 0 when there are no conversions — this is misleading. Filter to total_conversions > 0 before using this column in averages or comparisons. # Calculated KPIs - name: conversion_rate - description: Conversion rate (conversions / sessions) + description: | + [Business Purpose] Conversions divided by sessions. Measures how effectively site traffic converts to revenue events. Core efficiency KPI for campaign optimization. + [Known Issues / Caveats] Returns 0 when total_sessions is 0 (denominator guard), not NULL. Downstream consumers should treat 0 with caution — it may mean "no data" rather than "zero conversions from real traffic". - name: roas - description: Return on ad spend (revenue / spend) + description: | + [Business Purpose] Return on ad spend: total_revenue divided by spend. Values above 1.0 mean the campaign generates more revenue than it costs. Primary profitability signal. + [Known Issues / Caveats] Returns 0 when spend is 0 (denominator guard). Current range: 0.07 to 6.20 — wide spread indicates significant performance variation across campaigns. - name: cost_per_conversion - description: Cost per acquisition - spend divided by conversions + description: | + [Business Purpose] Spend divided by total conversions. Measures acquisition cost per conversion event. Lower is better — compare against avg_order_value to assess unit economics. + [Known Issues / Caveats] Returns 0 when total_conversions is 0 (denominator guard). A 0 here means "no conversions to divide by", not "free acquisitions". - name: click_to_session_rate - description: Rate of clicks that resulted in sessions + description: | + [Business Purpose] Sessions divided by clicks. Measures what fraction of ad clicks result in tracked site sessions. Values below 1.0 indicate attribution or tracking gaps between the ad platform and site analytics. + [Known Issues / Caveats] Returns 0 when clicks is 0 (denominator guard). - name: daily_summary description: | - Daily rollup summary fact table aggregating all campaign performance - across the entire business. Provides a high-level view of marketing - performance over time. - - Grain: One row per date - - Key metrics: - - Campaign and channel activity levels - - Total spend and budget utilization - - Aggregate advertising performance - - Overall session and user engagement - - Total conversions and revenue - - Portfolio-level KPIs (conversion rate, ROAS, CPA) + [Business Purpose] Answers how the overall marketing portfolio performs day-over-day. Enables executives and analysts to spot macro trends in spend efficiency, audience reach, and revenue without drilling into individual campaigns. + + [How It's Used] Executive dashboards for daily marketing health. Week-over-week and month-over-month trend analysis. Anomaly detection for total spend or conversion drops. AI agents use it as a starting point before drilling into campaign_performance for root cause. + + [Data Grain] One row per date. Aggregates all rows from campaign_performance for that date. Inherits its data from the campaign_performance mart, not directly from staging models. + + [Known Issues / Caveats] total_sessions is uniformly 2,200 every day (20 campaigns x 110) and total_conversions is uniformly 150 every day — both reflect the synthetic uniformity in the underlying session and conversion source data. Date 2025-12-20 is missing (gap inherited from source). budget_utilization, overall_conversion_rate, overall_roas, and overall_cpa use NULLIF for division safety — they will return NULL on days where the denominator is zero (no such days exist currently, but would on a zero-spend or zero-session day). avg_ctr and avg_cpc are simple averages across campaigns, not impression-weighted — they can be misleading when campaign sizes differ significantly. columns: - name: date - description: Date of the summary metrics + description: "[Business Purpose] Calendar date for the summary row. Sole grain column — each date appears exactly once. Use for time-series trending of portfolio-level KPIs." data_tests: - not_null - unique # Campaign metrics - name: active_campaigns - description: Number of distinct campaigns active on this date + description: "[Business Purpose] Count of distinct campaigns with data on this date. Tracks portfolio breadth — a sudden drop may indicate paused campaigns or source data issues. Currently constant at 20." - name: active_channels - description: Number of distinct channels with activity + description: "[Business Purpose] Count of distinct marketing channels active on this date. Measures platform diversification. Currently constant at 8." # Spend metrics - name: total_spend - description: Total amount spent across all campaigns in dollars + description: "[Business Purpose] Sum of spend across all campaigns for this date. Primary cost metric for portfolio-level budget monitoring. Range: ~$34k–$50k/day in current data." - name: total_budget - description: Total budget allocated across all campaigns + description: "[Business Purpose] Sum of daily_budget across all campaigns. Represents the theoretical maximum spend if all campaigns fully pace. Compare against total_spend via budget_utilization." - name: budget_utilization - description: Percentage of budget actually spent (spend / budget) + description: | + [Business Purpose] Ratio of total_spend to total_budget. Values near 1.0 mean campaigns are spending their full allocation. Low values suggest pacing issues or audience saturation. + [Known Issues / Caveats] Uses NULLIF(total_budget, 0) — returns NULL if total budget is zero on a given day. # Impression & Click metrics - name: total_impressions - description: Total ad impressions across all campaigns + description: "[Business Purpose] Sum of impressions across all campaigns. Top-of-funnel volume indicator for the entire portfolio." - name: total_clicks - description: Total clicks across all campaigns + description: "[Business Purpose] Sum of clicks across all campaigns. Aggregate demand signal from ad impressions." - name: avg_ctr - description: Average click-through rate across campaigns + description: | + [Business Purpose] Simple average of CTR across campaigns. Directional indicator of overall ad creative health. + [Known Issues / Caveats] This is an unweighted average — campaigns with 1,000 impressions count equally to campaigns with 200,000. For impression-weighted CTR, compute total_clicks / total_impressions instead. - name: avg_cpc - description: Average cost per click across campaigns + description: | + [Business Purpose] Simple average of CPC across campaigns. Directional indicator of auction cost trends. + [Known Issues / Caveats] Unweighted average — same caveat as avg_ctr. For spend-weighted CPC, compute total_spend / total_clicks. # Session metrics - name: total_sessions - description: Total website sessions from all campaigns + description: | + [Business Purpose] Sum of sessions across all campaigns. Measures total site traffic driven by marketing on this date. + [Known Issues / Caveats] Currently reads 2,200 every day (20 campaigns x 110 uniform sessions) — reflects synthetic source data. - name: total_users - description: Total unique users across all campaigns + description: | + [Business Purpose] Sum of unique_users across campaigns. Note: this is an additive sum, not a deduplicated user count — the same user visiting via two campaigns is counted twice. + [Known Issues / Caveats] Overstates true unique reach. For deduplicated counts, query stg_sessions directly. - name: avg_session_duration - description: Average session duration in seconds + description: "[Business Purpose] Average session duration in seconds across all campaigns. Proxy for overall content engagement quality driven by marketing traffic." - name: avg_pages_per_session - description: Average pages viewed per session + description: "[Business Purpose] Average pages per session across all campaigns. Measures depth of engagement for marketing-driven traffic." # Conversion metrics - name: total_conversions - description: Total conversions across all campaigns + description: | + [Business Purpose] Sum of conversions across all campaigns. Bottom-of-funnel portfolio outcome metric. + [Known Issues / Caveats] Currently constant at 150/day — reflects synthetic uniformity in conversion source data. - name: total_revenue - description: Total revenue from all conversions in dollars + description: "[Business Purpose] Sum of conversion revenue across all campaigns in dollars. Primary revenue metric for portfolio ROI. Range: ~$25.5k–$66.1k/day in current data." - name: avg_order_value - description: Average order value across all conversions + description: | + [Business Purpose] Simple average of per-campaign avg_order_value. Indicates the typical transaction size across the portfolio. + [Known Issues / Caveats] This averages campaign-level AOVs, including campaigns with zero conversions where AOV is COALESCE'd to 0 — pulling the average down. Filter campaign_performance to total_conversions > 0 before computing a meaningful portfolio AOV. # Calculated KPIs - name: overall_conversion_rate - description: Overall conversion rate (total conversions / total sessions) + description: | + [Business Purpose] Total conversions divided by total sessions across the portfolio. Measures aggregate funnel efficiency from visit to conversion. + [Known Issues / Caveats] Uses NULLIF(total_sessions, 0) — returns NULL if no sessions exist on a given day. - name: overall_roas - description: Overall return on ad spend (total revenue / total spend) + description: | + [Business Purpose] Total revenue divided by total spend across the portfolio. Values above 1.0 indicate the marketing program as a whole is revenue-positive. Current range: 0.54–1.62. + [Known Issues / Caveats] Uses NULLIF(total_spend, 0) — returns NULL on zero-spend days. - name: overall_cpa - description: Overall cost per acquisition (total spend / total conversions) - + description: | + [Business Purpose] Total spend divided by total conversions. Portfolio-level cost per acquisition. Compare against avg_order_value to assess whether acquisition cost is justified by transaction value. + [Known Issues / Caveats] Uses NULLIF(total_conversions, 0) — returns NULL on zero-conversion days.