-
Notifications
You must be signed in to change notification settings - Fork 107
feat(dsql): enhance query plan explainability with type coercion detection, rewrites, and workflow extraction #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f065f7a
4b0c990
e88e7a6
c640e19
aa2b4ec
48e707a
1178334
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,6 @@ | ||
| --- | ||
| name: dsql | ||
| description: "Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, and SQL compatibility validation. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow." | ||
| description: "Build with Aurora DSQL — manage schemas, execute queries, handle migrations, diagnose query plans, and develop applications with a serverless, distributed SQL database. Covers IAM auth, multi-tenant patterns, MySQL-to-DSQL migration, DDL operations, query plan explainability, and SQL compatibility validation. Triggers on phrases like: DSQL, Aurora DSQL, create DSQL table, DSQL schema, migrate to DSQL, distributed SQL database, serverless PostgreSQL-compatible database, DSQL query plan, DSQL EXPLAIN ANALYZE, why is my DSQL query slow, DSQL query performance, DSQL full scan, DSQL DPU, DSQL query cost, DSQL latency, optimize this query, this query is slow, explain this plan, query performance, high DPU, make this faster, why is this doing a full scan." | ||
| license: Apache-2.0 | ||
| metadata: | ||
| tags: aws, aurora, dsql, distributed-sql, distributed, distributed-database, database, serverless, serverless-database, postgresql, postgres, sql, schema, migration, multi-tenant, iam-auth, aurora-dsql, mcp, orm | ||
|
|
@@ -35,7 +35,7 @@ Load these files as needed for detailed guidance: | |
|
|
||
| **When:** Always load for guidance using or updating the DSQL MCP server | ||
| **Contains:** Instructions for setting up the DSQL MCP server with 2 configuration options as | ||
| sampled in [.mcp.json](../../.mcp.json) | ||
| sampled in [mcp/.mcp.json](mcp/.mcp.json) | ||
|
|
||
| 1. Documentation-Tools Only | ||
| 2. Database Operations (requires a cluster endpoint) | ||
|
|
@@ -111,8 +111,10 @@ sampled in [.mcp.json](../../.mcp.json) | |
|
|
||
| ### Query Plan Explainability (modular): | ||
|
|
||
| **When:** MUST load all four at Workflow 8 Phase 0 — [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md), [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md), [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md), [query-plan/report-format.md](references/query-plan/report-format.md) | ||
| **Contains:** DSQL node types + Node Duration math + estimation-error bands, pg_class/pg_stats/pg_indexes SQL + correlated-predicate verification, GUC experiment procedures + 30-second skip protocol, required report structure + element checklist + support request template | ||
| #### [query-plan/workflow.md](references/query-plan/workflow.md) | ||
|
|
||
| **When:** MUST load at Workflow 8 entry — it gates all other query-plan files | ||
| **Contains:** Trigger criteria, context disambiguation, routing, phased workflow (Phase 0–4). Workflow.md specifies which reference files to load at each phase — follow its loading instructions rather than loading all files upfront | ||
|
|
||
| ### SQL Compatibility Validation: | ||
|
|
||
|
|
@@ -164,16 +166,17 @@ defaults that may change — when a user's decision depends on an exact limit, v | |
| | Max indexes per table | 24 | `aurora dsql index limits` | | ||
| | Max columns per index | 8 | `aurora dsql index limits` | | ||
| | IDENTITY/SEQUENCE CACHE values | 1 or >= 65536 | `aurora dsql sequence cache` | | ||
| | Supported column data types | See docs | `aurora dsql supported data types` | | ||
|
|
||
| **When to verify:** Before recommending batch sizes, connection pool settings, or schema designs where hitting a limit would cause failures; any time the exact number can affect user decision. | ||
| **When to verify:** Before recommending batch sizes, connection pool settings, or schema designs | ||
| where hitting a limit would cause failures. No need to verify for general guidance or when | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. unnecessary and bad format/practice in skill linebreaks again? LOC consumption occupies additional token usage in skills frequently? |
||
| the exact number doesn't affect the user's decision. | ||
|
|
||
| **Fallback:** If `awsknowledge` is unavailable, use the defaults above and flag that limits should be verified against [DSQL documentation](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/). | ||
| **Fallback:** If `awsknowledge` is unavailable, MUST tell the user the lookup failed, MUST name the limit and its default value from the table above, and MUST link to [DSQL documentation](https://docs.aws.amazon.com/aurora-dsql/latest/userguide/) for verification. When the recommendation depends on the exact value (e.g., batch size at the 3,000 row boundary), MUST refuse the fallback and require the user to verify the limit manually. | ||
|
|
||
| ## CLI Scripts Available | ||
|
|
||
| Bash scripts in [scripts/](../../scripts/) for cluster management (create, delete, list, cluster info), psql connection, and bulk data loading from local/s3 csv/tsv/parquet files. | ||
| See [scripts/README.md](../../scripts/README.md) for usage and hook configuration. | ||
| Bash scripts in [scripts/](scripts/) for cluster management (create, delete, list, cluster info), psql connection, and bulk data loading from local/s3 csv/tsv/parquet files. | ||
| See [scripts/README.md](scripts/README.md) for usage. | ||
|
|
||
| --- | ||
|
|
||
|
|
@@ -197,7 +200,7 @@ See [scripts/README.md](../../scripts/README.md) for usage and hook configuratio | |
| - MUST include tenant_id in all tables | ||
| - MUST use `CREATE INDEX ASYNC` exclusively | ||
| - MUST issue each DDL in its own transact call: `transact(["CREATE TABLE ..."])` | ||
| - MUST serialize arrays as TEXT or JSON; cast back at query time (`string_to_array(text, ',')` or `jsonb_array_elements_text(json::jsonb)`) | ||
| - MUST store arrays/JSON as TEXT | ||
|
|
||
| ### Workflow 2: Safe Data Migration | ||
|
|
||
|
|
@@ -215,7 +218,10 @@ Every DDL statement generated in this workflow MUST be validated with `dsql_lint | |
| - MUST batch updates under 3,000 rows in separate transact calls | ||
| - MUST issue each ALTER TABLE in its own transaction | ||
|
|
||
| **Recovery — batch fails midway:** Rows already updated keep their new value (each batch committed independently). Resume by filtering on the unset state (`WHERE new_column IS NULL`) and continue. Re-running is safe because the filter naturally excludes completed rows. | ||
| **Recovery — batch fails midway:** Rows already updated keep their new value (each batch committed | ||
| in its own transaction). Resume by filtering on the unset state — e.g. add | ||
| `WHERE new_column IS NULL` (or the sentinel value) to the next UPDATE — and continue from there. | ||
| Re-running the entire migration is safe because the filter naturally excludes completed rows. | ||
|
Comment on lines
+221
to
+224
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. LOC bloating here again? |
||
|
|
||
| ### Workflow 3: Application-Layer Referential Integrity | ||
|
|
||
|
|
@@ -252,31 +258,7 @@ Run `dsql_lint(sql=source_sql, fix=true)` to validate and auto-convert PostgreSQ | |
|
|
||
| ### Workflow 8: Query Plan Explainability | ||
|
|
||
| Explains why the DSQL optimizer chose a particular plan. Triggered by slow queries, high DPU, unexpected Full Scans, or plans the user doesn't understand. **REQUIRES a structured Markdown diagnostic report is the deliverable** beyond conversation — run the workflow end-to-end before answering. Use the `aurora-dsql` MCP when connected; fall back to raw `psql` with a generated IAM token (see the fallback block below) otherwise. | ||
|
|
||
| **Phase 0 — Load reference material.** Read all four before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements): | ||
|
|
||
| 1. [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md) — node types, duration math, anomalous values | ||
| 2. [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL | ||
| 3. [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md) — GUC procedures and `>30s` skip protocol | ||
| 4. [query-plan/report-format.md](references/query-plan/report-format.md) — required report structure | ||
|
|
||
| **Phase 1 — Capture the plan.** **ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks. When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause. Extract Query ID, Planning Time, Execution Time, DPU Estimate. **SELECT** runs as-is. **UPDATE/DELETE** rewrite to the equivalent SELECT (same join chain + WHERE) — the optimizer picks the same plan shape. **INSERT**, pl/pgsql, DO blocks, and functions **MUST** be rejected. **MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety. | ||
|
|
||
| **Phase 2 — Gather evidence.** Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe). Detect correlated predicates and data skew. | ||
|
|
||
| **Phase 3 — Experiment (conditional).** ≤30s: run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test. >30s: skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing. Anomalous values (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`. | ||
|
|
||
| **Phase 4 — Produce the report, invite reassessment.** Produce the full diagnostic report per the "Required Elements Checklist" in [query-plan/report-format.md](references/query-plan/report-format.md) — structure is non-negotiable. End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation. When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report. | ||
|
|
||
| **psql fallback (MCP unavailable).** Pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence: | ||
|
|
||
| ```bash | ||
| TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION") | ||
| PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;" | ||
| ``` | ||
|
|
||
| **Safety.** Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql. | ||
| Explains why the DSQL optimizer chose a particular plan. **REQUIRES a structured Markdown diagnostic report as the deliverable.** MUST load [query-plan/workflow.md](references/query-plan/workflow.md) for trigger criteria, context disambiguation, routing, and the full phased workflow (Phase 0–4). | ||
|
|
||
| --- | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| # Query Rewrites — DSQL-Specific | ||
|
|
||
| SQL rewrites that address Aurora DSQL-specific behaviors and optimizer constraints. These SHOULD be recommended when the plan reveals inefficiency unique to DSQL's distributed architecture. | ||
|
|
||
| ## Available Rewrites | ||
|
|
||
| | Pattern Detected | Reference File | | ||
| | ------------------------------- | ------------------------------------------------------------- | | ||
| | COUNT(*) timeout on large table | [reltuples-estimate.md](query-rewrites/reltuples-estimate.md) | | ||
| | Join count exceeds DP threshold | [split-large-joins.md](query-rewrites/split-large-joins.md) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrase to be prescriptive rather than avoidant prohibition?