Point it at SAP tables in Unity Catalog. Get readable views your business users can actually query.
Overview · Quick Start · Widget Reference · How It Works · Output · Troubleshooting
- Overview
- Why This Exists
- Quick Start
- Widget Reference
- How It Works
- Output
- Examples
- Requirements
- Troubleshooting
- License
SAP source data lives behind table and column names like MARA, BUKRS, MANDT, VBELN. They are short, German-abbreviated, and unreadable to anyone who is not a seasoned SAP consultant. This agent uses a Large Language Model on top of Databricks to:
- Read SAP table and column names from Unity Catalog
- Translate them into clear, snake_case English business names
- Generate
CREATE VIEWstatements that expose the original tables under their new readable names - Optionally add column-level comments and group views into domain-specific schemas
The result is a layer of business-friendly views over your raw SAP data — discoverable, queryable, and ready for Genie spaces, dashboards, and self-service AI/BI.
Most SAP customers face the same problem when they land their data in a lakehouse:
- Analysts cannot find anything because every table and column is cryptic.
- Documentation lives in PDFs, wikis, and tribal knowledge — not in the catalog.
- Hand-writing readable views for hundreds of tables is tedious and error-prone.
This agent does it for you. One run over a catalog or a schema produces a parallel layer of readable views, with rich column descriptions, generated by an LLM that has been trained on SAP's data model.
-
Import the notebook into your Databricks workspace.
- Easiest path: clone this repo as a Databricks Repo, or upload
agent/sap_model_decoder_agent.ipynbdirectly.
- Easiest path: clone this repo as a Databricks Repo, or upload
-
Attach to Serverless compute. The agent uses Databricks Foundation Model APIs (
ai_query) and runs entirely on Serverless SQL. -
Set the widgets at the top of the notebook (see Widget Reference).
-
Run all. The agent processes tables in parallel and creates the views in your chosen destination.
-
Validate. Open the new schema, browse the views, run a few
SELECT *queries, point Genie at it.
| # | Widget | Description |
|---|---|---|
| 1 | UC Metadata | Comma-separated list of catalogs, schemas, or tables to process. Detection is automatic by dot-count: 0 dots = catalog (all schemas + tables); 1 dot = schema (all tables); 2 dots = single table. Mix freely: prod.finance.bkpf, prod.sd, my_catalog. |
| 2 | View Generation Style | same catalog writes views into the source catalog. different catalog writes to the catalog set in widget 3. |
| 3 | Destination Catalog | Used only when widget 2 is set to different catalog. The target catalog where new views are created. |
| 4 | Include Source Name | No (default), as suffix, or as prefix. Useful when multiple SAP sources land in the same lakehouse and you want to keep the source visible in the view name. |
| 5 | Generate Comments | Yes to apply column-level COMMENT statements on each new view, with the LLM-generated business descriptions. |
| 6 | Group Views into Databases | Yes clusters the new views into domain-specific schemas (FI, CO, SD, MM, etc.) inferred by the LLM. No keeps a flat layout. |
| 7 | Parallelisation (Batch Size) | How many tables to decode and create concurrently. Default 25. Range 5–100 in steps of 5. Tune up for big runs, down if your warehouse is small. |
┌──────────────────────────────────────────────────────────────────────────┐
│ ① Discover │
│ Walks the UC entries you provided and resolves them to a flat list of │
│ source tables (catalog.schema.table). │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ ② Decode │
│ For each table, calls the Foundation Model API with a structured │
│ prompt that asks for: business view name, view description, and a │
│ CSV decoding of every column (kept verbatim if already readable). │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ ③ (Optional) Group │
│ If "Group Views into Databases" is on, a second LLM call clusters │
│ the new views into domain schemas (FI, CO, SD, MM, …). │
└──────────────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ ④ Apply │
│ Generates and runs CREATE VIEW statements in parallel batches. If │
│ "Generate Comments" is on, applies COMMENT ON COLUMN per column. │
└──────────────────────────────────────────────────────────────────────────┘
The decoding prompt operates at the column level, not the table level. Cryptic columns get translated; columns that are already in clear English are kept verbatim. Same table can have a mix of both.
For every source table the agent produces:
- A view in the destination catalog/schema with a business-friendly name.
- A view-level description explaining the business purpose of the table.
- Column aliases that translate cryptic SAP names into readable English.
- Column-level comments (if
generate_comments = Yes). - Domain grouping (if
group_views_into_dbs = Yes).
For example, prod.sap_raw.bkpf → view prod.sap_business.accounting_document_header with columns like client, company_code, accounting_document_number, created_on.
UC Metadata: prod.sap_raw.bkpf
View Generation Style: same catalog
Generate Comments: Yes
Result: prod.sap_raw.accounting_document_header view with column aliases and comments.
UC Metadata: prod.sap_raw
View Generation Style: different catalog
Destination Catalog: prod_business
Group Views into Dbs: Yes
Generate Comments: Yes
Parallelisation: 50
Result: every table under prod.sap_raw is decoded; the new views land in prod_business.<domain>.<view_name> (e.g. prod_business.fi.accounting_document_header, prod_business.sd.sales_order_header).
UC Metadata: prod.sap_ecc, prod.sap_s4hana, prod.legacy.customer_data
Include Source Name: as suffix
View Generation Style: different catalog
Destination Catalog: prod_business
Result: views named customer_data_legacy, material_master_sap_ecc, etc., so the source lineage stays visible.
- Databricks workspace with Unity Catalog enabled
- Access to Databricks Foundation Model APIs (the agent calls
ai_queryagainst a Foundation Model serving endpoint) - A Serverless SQL warehouse (the notebook runs on Serverless)
- Permission to create views (and optionally schemas) in the destination catalog
Python dependencies are listed in requirements.txt and are pre-installed on standard Databricks runtimes.
Some columns were not decoded. By design. The agent keeps already-readable English column names verbatim; only cryptic SAP-style names are translated.
A view failed to create. Check the notebook output — the agent logs the failing SQL and the original Spark error. Most common causes: insufficient permissions on the destination catalog, or a column name that collides with an existing keyword. Re-run after fixing the underlying permission/naming issue.
The LLM returned an unexpected format.
The agent expects a strict CSV response. If the model strays, the parsing helper (parse_csv_decode_response) reports the row that failed. Re-running usually fixes it; if it persists, lower the parallelism widget so each call is processed independently.
Domain grouping placed views in the wrong schema.
The grouping pass is a single LLM call over the whole batch. For very large catalogs (1000s of tables) it can drift. Either disable grouping (group_views_into_dbs = No) and group manually, or run the agent over one source schema at a time.