Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions databricks-skills/databricks-synthetic-data-gen/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,10 +83,19 @@ Show a clear specification with **YOUR ASSUMPTIONS surfaced**. Always start with
Volume: /Volumes/{user_catalog}/ecommerce_demo/raw_data/
```

| Table | Columns | Rows | Key Assumptions |
|-------|---------|------|-----------------|
| customers | customer_id, name, email, tier, region | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
| orders | order_id, customer_id (FK), amount, status | 15,000 | Enterprise customers generate 5x more orders |
| Table | Columns | Description | Rows | Key Assumptions |
|-------|---------|-------------|------|-----------------|
| customers | customer_id, name, email, tier, region | Synthetic customer profiles | 5,000 | Tier: Free 60%, Pro 30%, Enterprise 10% |
| orders | order_id, customer_id (FK), amount, status | Customer purchase transactions | 15,000 | Enterprise customers generate 5x more orders |

Include column-level descriptions in the plan (these become column comments in Unity Catalog):

| Table | Column | Comment |
|-------|--------|---------|
| customers | customer_id | Unique customer identifier (CUST-XXXXX) |
| customers | tier | Customer tier: Free, Pro, Enterprise |
| orders | customer_id | FK to customers.customer_id |
| orders | amount | Order total in USD |

**Assumptions I'm making:**
- Amount distribution: log-normal by tier (Enterprise ~$1800, Pro ~$245, Free ~$55)
Expand Down Expand Up @@ -238,6 +247,7 @@ See [references/5-output-formats.md](references/5-output-formats.md) for detaile
- Create infrastructure in script (`CREATE SCHEMA/VOLUME IF NOT EXISTS`)
- Do NOT create catalogs - assume they exist
- Delta tables as default
- Add table and column comments for discoverability in Unity Catalog (see [references/5-output-formats.md](references/5-output-formats.md))

## Related Skills

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ SCHEMA = "<user-provided-schema>"
VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data"

# Note: Assume catalog exists - do NOT create it
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA} COMMENT 'Synthetic data for demo scenario'")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
```

**Important:** Do NOT create catalogs - assume they already exist. Only create schema and volume.
**Important:** Do NOT create catalogs - assume they already exist. Only create schema and volume. Always add a `COMMENT` to schemas describing the dataset purpose.

---

Expand Down Expand Up @@ -126,6 +126,62 @@ customers_df.write \
- Skip the SDP bronze/silver/gold pipeline
- Direct SQL analytics

### Adding Table and Column Comments

Always add comments to Delta tables for discoverability in Unity Catalog. Prefer DDL-first approach — define the table with comments, then insert data.

**DDL-first (preferred):**
```python
# Create table with inline column comments and table comment
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {CATALOG}.{SCHEMA}.customers (
customer_id STRING COMMENT 'Unique customer identifier (CUST-XXXXX)',
name STRING COMMENT 'Full customer name',
email STRING COMMENT 'Customer email address',
tier STRING COMMENT 'Customer tier: Free, Pro, Enterprise',
region STRING COMMENT 'Geographic region',
arr DOUBLE COMMENT 'Annual recurring revenue in USD'
)
COMMENT 'Synthetic customer data for e-commerce demo'
""")

# Then write data into the pre-defined table
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
```

**PySpark schema with comments:**
```python
from pyspark.sql.types import StructType, StructField, StringType, DoubleType

schema = StructType([
StructField("customer_id", StringType(), True, metadata={"comment": "Unique customer identifier (CUST-XXXXX)"}),
StructField("name", StringType(), True, metadata={"comment": "Full customer name"}),
StructField("email", StringType(), True, metadata={"comment": "Customer email address"}),
StructField("tier", StringType(), True, metadata={"comment": "Customer tier: Free, Pro, Enterprise"}),
StructField("region", StringType(), True, metadata={"comment": "Geographic region"}),
StructField("arr", DoubleType(), True, metadata={"comment": "Annual recurring revenue in USD"}),
])

# Apply schema when creating the DataFrame, comments persist when saved as Delta
customers_df = spark.createDataFrame(data, schema)
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")
```

**Post-write (alternative):**
```python
# Write first, then add comments
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")

# Add table comment
spark.sql(f"COMMENT ON TABLE {CATALOG}.{SCHEMA}.customers IS 'Synthetic customer data for e-commerce demo'")

# Add column comments
spark.sql(f"ALTER TABLE {CATALOG}.{SCHEMA}.customers ALTER COLUMN customer_id COMMENT 'Unique customer identifier (CUST-XXXXX)'")
spark.sql(f"ALTER TABLE {CATALOG}.{SCHEMA}.customers ALTER COLUMN tier COMMENT 'Customer tier: Free, Pro, Enterprise'")
```

**Note:** Column/table comments only apply to Delta tables in Unity Catalog. Parquet/JSON/CSV files written to volumes do not support metadata comments.

---

## Write Modes
Expand Down