Skip to content

feat: add schema anonymization mode#8

Open
wondr-wclabs wants to merge 1 commit into
Cyberfilo:mainfrom
wondr-wclabs:codex/schema-anonymize
Open

feat: add schema anonymization mode#8
wondr-wclabs wants to merge 1 commit into
Cyberfilo:mainfrom
wondr-wclabs:codex/schema-anonymize

Conversation

@wondr-wclabs
Copy link
Copy Markdown

Closes #3.

This adds a --anonymize mode that keeps PromptQuery's local retrieval and safety model intact while avoiding sending real table and column names to LLMs.

The reason I kept this as a separate SchemaAnonymizer layer instead of mixing it into format_schema() is that there are two different responsibilities here:

  • retrieval still needs the real schema locally, otherwise user questions like "orders by customer" lose the lexical signal that makes TF-IDF useful
  • prompts sent to the selector/generator should receive opaque schema copies, so table names, column names, schema names, and table comments are not exposed
  • generated SQL needs to be mapped back before validate_select_only() and db.execute(), so the existing sqlglot safety guard and read-only DB session continue to operate on the real SQL

Implementation notes:

  • table tokens are deterministic (table_001, table_002, ...), based on the introspected schema order
  • column tokens are deterministic per table (column_001, column_002, ...), preserving column order
  • PK/NOT NULL markers, data types, and FK structure are preserved because the generator still needs relational shape to write plausible joins
  • table comments are omitted in anonymized prompts because comments often reintroduce the same business-specific names that the flag is meant to hide
  • reverse mapping uses sqlglot AST traversal rather than text replacement, and preserves multiple statements so the existing safety validator can still reject them

One deliberate boundary: this anonymizes schema identifiers sent by PromptQuery, not the user's natural-language question. It also preserves PostgreSQL type strings for query quality, so a custom type name could still be visible if the database uses semantic type names. I left that out of scope because the issue calls for table/column anonymisation and because replacing type information would make generated SQL materially worse.

Tests added/updated:

  • anonymized schema rendering hides real table/column names and comments while preserving PK/FK structure
  • generated SQL maps back across aliases and non-public schemas
  • multiple-statement generated SQL remains rejected after de-anonymisation
  • run_question() sends anonymized prompts to both generator and selector paths
  • CLI help exposes --anonymize

Validation:

  • .venv/bin/pytest -> 56 passed
  • git diff --check -> clean

@wondr-wclabs wondr-wclabs force-pushed the codex/schema-anonymize branch from 6979710 to 97d333b Compare June 5, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Schema anonymisation mode (--anonymize)

1 participant