Skip to content

Commit 4e7f910

Browse files
feat: add metadata comments guidance to datagen skill (#292)
* feat: add table/column comments guidance to synthetic data gen skill Adds documentation for DDL-first and post-write approaches to set table and column comments when writing Delta tables to Unity Catalog. * feat: add PySpark StructField metadata approach for column comments --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3023de7 commit 4e7f910

2 files changed

Lines changed: 72 additions & 6 deletions

File tree

databricks-skills/databricks-synthetic-data-gen/SKILL.md

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,19 @@ Show a clear specification with **YOUR ASSUMPTIONS surfaced**. Always start with
8383
Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/
8484
```
8585

86-
| Table | Columns | Rows | Key Assumptions |
87-
|-------|---------|------|-----------------|
88-
| customers | customer_id, name, email, tier, region | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
89-
| orders | order_id, customer_id (FK), amount, status | 15,000 | Enterprise customers generate 5x more orders |
86+
| Table | Columns | Description | Rows | Key Assumptions |
87+
|-------|---------|-------------|------|-----------------|
88+
| customers | customer_id, name, email, tier, region | Synthetic customer profiles | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
89+
| orders | order_id, customer_id (FK), amount, status | Customer purchase transactions | 15,000 | Enterprise customers generate 5x more orders |
90+
91+
Include column-level descriptions in the plan (these become column comments in Unity Catalog):
92+
93+
| Table | Column | Comment |
94+
|-------|--------|---------|
95+
| customers | customer_id | Unique customer identifier (CUST-XXXXX) |
96+
| customers | tier | Customer tier: Free, Pro, Enterprise |
97+
| orders | customer_id | FK to customers.customer_id |
98+
| orders | amount | Order total in USD |
9099

91100
**Assumptions I'm making:**
92101
- Amount distribution: log-normal by tier (Enterprise ~$1800, Pro ~$245, Free ~$55)
@@ -238,6 +247,7 @@ See [references/5-output-formats.md](references/5-output-formats.md) for detaile
238247
- Create infrastructure in script (`CREATE SCHEMA/VOLUME IF NOT EXISTS`)
239248
- Do NOT create catalogs - assume they exist
240249
- Delta tables as default
250+
- Add table and column comments for discoverability in Unity Catalog (see [references/5-output-formats.md](references/5-output-formats.md))
241251

242252
## Related Skills
243253

databricks-skills/databricks-synthetic-data-gen/references/5-output-formats.md

Lines changed: 58 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ SCHEMA = "<user-provided-schema>"
1212
VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data"
1313

1414
# Note: Assume catalog exists - do NOT create it
15-
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
15+
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA} COMMENT 'Synthetic data for demo scenario'")
1616
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
1717
```
1818

19-
**Important:** Do NOT create catalogs - assume they already exist. Only create schema and volume.
19+
**Important:** Do NOT create catalogs - assume they already exist. Only create schema and volume. Always add a `COMMENT` to schemas describing the dataset purpose.
2020

2121
---
2222

@@ -126,6 +126,62 @@ customers_df.write \
126126
- Skip the SDP bronze/silver/gold pipeline
127127
- Direct SQL analytics
128128

129+
### Adding Table and Column Comments
130+
131+
Always add comments to Delta tables for discoverability in Unity Catalog. Prefer DDL-first approach — define the table with comments, then insert data.
132+
133+
**DDL-first (preferred):**
134+
```python
135+
# Create table with inline column comments and table comment
136+
spark.sql(f"""
137+
CREATE TABLE IF NOT EXISTS {CATALOG}.{SCHEMA}.customers (
138+
customer_id STRING COMMENT 'Unique customer identifier (CUST-XXXXX)',
139+
name STRING COMMENT 'Full customer name',
140+
email STRING COMMENT 'Customer email address',
141+
tier STRING COMMENT 'Customer tier: Free, Pro, Enterprise',
142+
region STRING COMMENT 'Geographic region',
143+
arr DOUBLE COMMENT 'Annual recurring revenue in USD'
144+
)
145+
COMMENT 'Synthetic customer data for e-commerce demo'
146+
""")
147+
148+
# Then write data into the pre-defined table
149+
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
150+
```
151+
152+
**PySpark schema with comments:**
153+
```python
154+
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
155+
156+
schema = StructType([
157+
StructField("customer_id", StringType(), True, metadata={"comment": "Unique customer identifier (CUST-XXXXX)"}),
158+
StructField("name", StringType(), True, metadata={"comment": "Full customer name"}),
159+
StructField("email", StringType(), True, metadata={"comment": "Customer email address"}),
160+
StructField("tier", StringType(), True, metadata={"comment": "Customer tier: Free, Pro, Enterprise"}),
161+
StructField("region", StringType(), True, metadata={"comment": "Geographic region"}),
162+
StructField("arr", DoubleType(), True, metadata={"comment": "Annual recurring revenue in USD"}),
163+
])
164+
165+
# Apply schema when creating the DataFrame, comments persist when saved as Delta
166+
customers_df = spark.createDataFrame(data, schema)
167+
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
168+
```
169+
170+
**Post-write (alternative):**
171+
```python
172+
# Write first, then add comments
173+
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
174+
175+
# Add table comment
176+
spark.sql(f"COMMENT ON TABLE {CATALOG}.{SCHEMA}.customers IS 'Synthetic customer data for e-commerce demo'")
177+
178+
# Add column comments
179+
spark.sql(f"ALTER TABLE {CATALOG}.{SCHEMA}.customers ALTER COLUMN customer_id COMMENT 'Unique customer identifier (CUST-XXXXX)'")
180+
spark.sql(f"ALTER TABLE {CATALOG}.{SCHEMA}.customers ALTER COLUMN tier COMMENT 'Customer tier: Free, Pro, Enterprise'")
181+
```
182+
183+
**Note:** Column/table comments only apply to Delta tables in Unity Catalog. Parquet/JSON/CSV files written to volumes do not support metadata comments.
184+
129185
---
130186

131187
## Write Modes

0 commit comments

Comments
 (0)