Skip to content

amralieg/sap-model-decoder-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SAP Model Decoder Agent

Translate cryptic SAP table and column names into business-friendly views, automatically

Databricks Serverless Unity Catalog LLM Powered

Point it at SAP tables in Unity Catalog. Get readable views your business users can actually query.


Overview · Quick Start · Widget Reference · How It Works · Output · Troubleshooting


Table of Contents


🚀 Overview

SAP source data lives behind table and column names like MARA, BUKRS, MANDT, VBELN. They are short, German-abbreviated, and unreadable to anyone who is not a seasoned SAP consultant. This agent uses a Large Language Model on top of Databricks to:

  1. Read SAP table and column names from Unity Catalog
  2. Translate them into clear, snake_case English business names
  3. Generate CREATE VIEW statements that expose the original tables under their new readable names
  4. Optionally add column-level comments and group views into domain-specific schemas

The result is a layer of business-friendly views over your raw SAP data — discoverable, queryable, and ready for Genie spaces, dashboards, and self-service AI/BI.


💡 Why This Exists

Most SAP customers face the same problem when they land their data in a lakehouse:

  • Analysts cannot find anything because every table and column is cryptic.
  • Documentation lives in PDFs, wikis, and tribal knowledge — not in the catalog.
  • Hand-writing readable views for hundreds of tables is tedious and error-prone.

This agent does it for you. One run over a catalog or a schema produces a parallel layer of readable views, with rich column descriptions, generated by an LLM that has been trained on SAP's data model.


🏁 Quick Start

  1. Import the notebook into your Databricks workspace.

    • Easiest path: clone this repo as a Databricks Repo, or upload agent/sap_model_decoder_agent.ipynb directly.
  2. Attach to Serverless compute. The agent uses Databricks Foundation Model APIs (ai_query) and runs entirely on Serverless SQL.

  3. Set the widgets at the top of the notebook (see Widget Reference).

  4. Run all. The agent processes tables in parallel and creates the views in your chosen destination.

  5. Validate. Open the new schema, browse the views, run a few SELECT * queries, point Genie at it.


🎛 Widget Reference

# Widget Description
1 UC Metadata Comma-separated list of catalogs, schemas, or tables to process. Detection is automatic by dot-count: 0 dots = catalog (all schemas + tables); 1 dot = schema (all tables); 2 dots = single table. Mix freely: prod.finance.bkpf, prod.sd, my_catalog.
2 View Generation Style same catalog writes views into the source catalog. different catalog writes to the catalog set in widget 3.
3 Destination Catalog Used only when widget 2 is set to different catalog. The target catalog where new views are created.
4 Include Source Name No (default), as suffix, or as prefix. Useful when multiple SAP sources land in the same lakehouse and you want to keep the source visible in the view name.
5 Generate Comments Yes to apply column-level COMMENT statements on each new view, with the LLM-generated business descriptions.
6 Group Views into Databases Yes clusters the new views into domain-specific schemas (FI, CO, SD, MM, etc.) inferred by the LLM. No keeps a flat layout.
7 Parallelisation (Batch Size) How many tables to decode and create concurrently. Default 25. Range 5100 in steps of 5. Tune up for big runs, down if your warehouse is small.

🔍 How It Works

┌──────────────────────────────────────────────────────────────────────────┐
│                          ① Discover                                      │
│  Walks the UC entries you provided and resolves them to a flat list of  │
│  source tables (catalog.schema.table).                                  │
└──────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                          ② Decode                                        │
│  For each table, calls the Foundation Model API with a structured       │
│  prompt that asks for: business view name, view description, and a      │
│  CSV decoding of every column (kept verbatim if already readable).      │
└──────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                          ③ (Optional) Group                              │
│  If "Group Views into Databases" is on, a second LLM call clusters      │
│  the new views into domain schemas (FI, CO, SD, MM, …).                 │
└──────────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                          ④ Apply                                         │
│  Generates and runs CREATE VIEW statements in parallel batches. If      │
│  "Generate Comments" is on, applies COMMENT ON COLUMN per column.       │
└──────────────────────────────────────────────────────────────────────────┘

The decoding prompt operates at the column level, not the table level. Cryptic columns get translated; columns that are already in clear English are kept verbatim. Same table can have a mix of both.


📦 Output

For every source table the agent produces:

  1. A view in the destination catalog/schema with a business-friendly name.
  2. A view-level description explaining the business purpose of the table.
  3. Column aliases that translate cryptic SAP names into readable English.
  4. Column-level comments (if generate_comments = Yes).
  5. Domain grouping (if group_views_into_dbs = Yes).

For example, prod.sap_raw.bkpf → view prod.sap_business.accounting_document_header with columns like client, company_code, accounting_document_number, created_on.


📚 Examples

Example 1 — Decode a single table into the same catalog

UC Metadata:           prod.sap_raw.bkpf
View Generation Style: same catalog
Generate Comments:     Yes

Result: prod.sap_raw.accounting_document_header view with column aliases and comments.

Example 2 — Decode a whole schema into a separate target catalog, grouped by domain

UC Metadata:           prod.sap_raw
View Generation Style: different catalog
Destination Catalog:   prod_business
Group Views into Dbs:  Yes
Generate Comments:     Yes
Parallelisation:       50

Result: every table under prod.sap_raw is decoded; the new views land in prod_business.<domain>.<view_name> (e.g. prod_business.fi.accounting_document_header, prod_business.sd.sales_order_header).

Example 3 — Multi-source mix

UC Metadata:           prod.sap_ecc, prod.sap_s4hana, prod.legacy.customer_data
Include Source Name:   as suffix
View Generation Style: different catalog
Destination Catalog:   prod_business

Result: views named customer_data_legacy, material_master_sap_ecc, etc., so the source lineage stays visible.


🛠 Requirements

  • Databricks workspace with Unity Catalog enabled
  • Access to Databricks Foundation Model APIs (the agent calls ai_query against a Foundation Model serving endpoint)
  • A Serverless SQL warehouse (the notebook runs on Serverless)
  • Permission to create views (and optionally schemas) in the destination catalog

Python dependencies are listed in requirements.txt and are pre-installed on standard Databricks runtimes.


🧰 Troubleshooting

Some columns were not decoded. By design. The agent keeps already-readable English column names verbatim; only cryptic SAP-style names are translated.

A view failed to create. Check the notebook output — the agent logs the failing SQL and the original Spark error. Most common causes: insufficient permissions on the destination catalog, or a column name that collides with an existing keyword. Re-run after fixing the underlying permission/naming issue.

The LLM returned an unexpected format. The agent expects a strict CSV response. If the model strays, the parsing helper (parse_csv_decode_response) reports the row that failed. Re-running usually fixes it; if it persists, lower the parallelism widget so each call is processed independently.

Domain grouping placed views in the wrong schema. The grouping pass is a single LLM call over the whole batch. For very large catalogs (1000s of tables) it can drift. Either disable grouping (group_views_into_dbs = No) and group manually, or run the agent over one source schema at a time.


📝 License

Apache License 2.0

About

SAP Model Decoder Agent — translate cryptic SAP table/column names into business-friendly Unity Catalog views using LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors