From 8b9831f590949b1e85f8fe5a4b3f57edc4407df2 Mon Sep 17 00:00:00 2001
From: Calvin Smithu <email@cjsmith.io>
Date: Tue, 17 Mar 2026 12:20:38 -0600
Subject: [PATCH 1/4] Add spark-version-upgrade skill with references and
 marketplace registration
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- SKILL.md: 6-phase upgrade workflow (inventory, build, API, config, SQL, test)
  covering Spark 2.x→3.x and 3.x→4.x with actionable checklists
- README.md: human-readable overview with trigger keywords and examples
- references/api-changes.md: removed/deprecated API catalog with before/after code
- references/config-changes.md: config property rename/removal mapping
- Registered in marketplaces/large-codebase.json

Co-authored-by: openhands <openhands@all-hands.dev>
---
 marketplaces/large-codebase.json              |  14 ++
 skills/spark-version-upgrade/README.md        |  55 +++++
 skills/spark-version-upgrade/SKILL.md         | 231 ++++++++++++++++++
 .../references/api-changes.md                 | 152 ++++++++++++
 .../references/config-changes.md              |  78 ++++++
 5 files changed, 530 insertions(+)
 create mode 100644 skills/spark-version-upgrade/README.md
 create mode 100644 skills/spark-version-upgrade/SKILL.md
 create mode 100644 skills/spark-version-upgrade/references/api-changes.md
 create mode 100644 skills/spark-version-upgrade/references/config-changes.md

diff --git a/marketplaces/large-codebase.json b/marketplaces/large-codebase.json
index 8083923..3e07222 100644
--- a/marketplaces/large-codebase.json
+++ b/marketplaces/large-codebase.json
@@ -47,6 +47,20 @@
         "evaluation",
         "reporting"
       ]
+    },
+    {
+      "name": "spark-version-upgrade",
+      "source": "./spark-version-upgrade",
+      "description": "Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation.",
+      "category": "development",
+      "keywords": [
+        "spark",
+        "upgrade",
+        "migration",
+        "pyspark",
+        "scala",
+        "big-data"
+      ]
     }
   ]
 }
diff --git a/skills/spark-version-upgrade/README.md b/skills/spark-version-upgrade/README.md
new file mode 100644
index 0000000..1e0d6bb
--- /dev/null
+++ b/skills/spark-version-upgrade/README.md
@@ -0,0 +1,55 @@
+# spark-version-upgrade
+
+Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation.
+
+## Triggers
+
+This skill is activated by the following keywords:
+
+- `spark upgrade`
+- `spark migration`
+- `spark version`
+- `upgrade spark`
+- `spark 3`
+- `spark 4`
+- `pyspark upgrade`
+
+## Overview
+
+This skill provides a structured, six-phase workflow for upgrading Apache Spark applications:
+
+| Phase | Description |
+|-------|-------------|
+| 1. Inventory & Impact Analysis | Scan the codebase, identify Spark usage, document scope |
+| 2. Build File Updates | Bump Spark/Scala/Java versions in Maven, SBT, Gradle, or pip |
+| 3. API Migration | Replace removed/deprecated APIs (SQLContext, Accumulator, etc.) |
+| 4. Configuration Migration | Rename/remove deprecated Spark config properties |
+| 5. SQL & DataFrame Migration | Fix breaking SQL behavior (ANSI mode, type coercion, date parsing) |
+| 6. Test Validation | Compile, test, compare output to pre-upgrade baseline |
+
+## Supported Upgrade Paths
+
+- **Spark 2.x → 3.x** — Major API removals (SQLContext, HiveContext, Accumulator v1), Scala 2.12/2.13
+- **Spark 3.x → 4.x** — ANSI mode default, Java 17+ requirement, Scala 2.13 only, legacy flag removal
+
+## Languages & Build Systems
+
+- **Languages**: Scala, Java, Python (PySpark)
+- **Build systems**: Maven, SBT, Gradle, pip/uv
+
+## Reference Material
+
+- [references/api-changes.md](references/api-changes.md) — Catalog of removed/deprecated APIs with before/after code
+- [references/config-changes.md](references/config-changes.md) — Spark configuration property rename/removal mapping
+
+## Example Usage
+
+Ask the agent:
+
+> "Upgrade this project from Spark 2.4 to Spark 3.5"
+
+> "Migrate our PySpark codebase to Spark 4.0"
+
+> "Fix all Spark deprecation warnings in this repo"
+
+The agent will follow the six-phase workflow, producing a `spark_upgrade_impact.md` document and systematically updating build files, code, configuration, and SQL queries.
diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md
new file mode 100644
index 0000000..fe1ca87
--- /dev/null
+++ b/skills/spark-version-upgrade/SKILL.md
@@ -0,0 +1,231 @@
+---
+name: spark-version-upgrade
+description: Upgrade Apache Spark applications between major versions (2.x→3.x, 3.x→4.x). Covers build files, deprecated APIs, configuration changes, SQL/DataFrame updates, and test validation.
+license: MIT
+compatibility: Requires Java 8+/11+/17+, Scala 2.12/2.13, Maven/Gradle/SBT, Apache Spark
+triggers:
+  - spark upgrade
+  - spark migration
+  - spark version
+  - upgrade spark
+  - spark 3
+  - spark 4
+  - pyspark upgrade
+---
+
+Upgrade Apache Spark applications between major versions with a structured, phase-by-phase workflow.
+
+## When to Use
+
+- Migrating from Spark 2.x → 3.x or Spark 3.x → 4.x
+- Updating PySpark, Spark SQL, or Structured Streaming applications
+- Resolving deprecation warnings before a Spark version bump
+
+## Workflow Overview
+
+1. **Inventory & Impact Analysis** — Scan the codebase and assess scope
+2. **Build File Updates** — Bump Spark/Scala/Java dependencies
+3. **API Migration** — Replace deprecated and removed APIs
+4. **Configuration Migration** — Update Spark config properties
+5. **SQL & DataFrame Migration** — Fix query-level breaking changes
+6. **Test Validation** — Compile, run tests, verify results
+
+---
+
+## Phase 1: Inventory & Impact Analysis
+
+Before changing any code, assess what needs to change.
+
+### Checklist
+
+- [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`)
+- [ ] Identify target Spark version
+- [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'`
+- [ ] List all Spark config properties: `grep -rn 'spark\.' --include='*.conf' --include='*.properties' --include='*.scala' --include='*.java' --include='*.py' | grep -v 'test'`
+- [ ] Check for custom `SparkSession` or `SparkContext` extensions
+- [ ] Identify connector dependencies (Hive, Kafka, Cassandra, Delta, Iceberg)
+- [ ] Document findings in `spark_upgrade_impact.md`
+
+### Output
+
+```
+spark_upgrade_impact.md   # Summary of affected files, APIs, and configs
+```
+
+---
+
+## Phase 2: Build File Updates
+
+Update dependency versions and resolve compilation.
+
+### Maven (`pom.xml`)
+
+```xml
+<!-- Update Spark version property -->
+<spark.version>3.5.1</spark.version>    <!-- or 4.0.0 -->
+<scala.version>2.13.12</scala.version>  <!-- Spark 3.x: 2.12/2.13; Spark 4.x: 2.13 -->
+
+<!-- Update artifact IDs if Scala cross-version changed -->
+<artifactId>spark-core_2.13</artifactId>
+<artifactId>spark-sql_2.13</artifactId>
+```
+
+### SBT (`build.sbt`)
+
+```scala
+val sparkVersion = "3.5.1" // or "4.0.0"
+scalaVersion := "2.13.12"
+
+libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion
+libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion
+```
+
+### Gradle (`build.gradle`)
+
+```groovy
+ext {
+    sparkVersion = '3.5.1' // or '4.0.0'
+}
+dependencies {
+    implementation "org.apache.spark:spark-core_2.13:${sparkVersion}"
+    implementation "org.apache.spark:spark-sql_2.13:${sparkVersion}"
+}
+```
+
+### PySpark (`requirements.txt` / `pyproject.toml`)
+
+```
+pyspark==3.5.1   # or 4.0.0
+```
+
+### Checklist
+
+- [ ] Update Spark version in build file
+- [ ] Update Scala version if crossing 2.12→2.13 boundary
+- [ ] Update Java source/target level if required (Spark 4.x requires Java 17+)
+- [ ] Update connector library versions to match new Spark version
+- [ ] Resolve dependency conflicts (`mvn dependency:tree` / `sbt dependencyTree`)
+- [ ] Confirm project compiles (errors at this stage are expected — they guide Phase 3)
+
+---
+
+## Phase 3: API Migration
+
+Replace removed and deprecated APIs. Work through compiler errors systematically.
+
+### Common Patterns
+
+See [references/api-changes.md](references/api-changes.md) for a full catalog.
+
+#### SparkSession Creation (2.x → 3.x)
+
+```scala
+// BEFORE (Spark 1.x/2.x)
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+
+// AFTER (Spark 2.x+/3.x)
+val spark = SparkSession.builder()
+  .config(conf)
+  .enableHiveSupport() // if needed
+  .getOrCreate()
+val sc = spark.sparkContext
+```
+
+#### RDD to DataFrame (2.x → 3.x)
+
+```scala
+// BEFORE
+rdd.toDF()  // implicit from SQLContext
+
+// AFTER
+import spark.implicits._
+rdd.toDF()  // implicit from SparkSession
+```
+
+#### Accumulator API (2.x → 3.x)
+
+```scala
+// BEFORE
+val acc = sc.accumulator(0)
+
+// AFTER
+val acc = sc.longAccumulator("name")
+```
+
+### Checklist
+
+- [ ] Replace `SQLContext` / `HiveContext` with `SparkSession`
+- [ ] Replace deprecated `Accumulator` with `AccumulatorV2`
+- [ ] Update `DataFrame` → `Dataset[Row]` where needed
+- [ ] Replace removed `RDD.mapPartitionsWithContext` with `mapPartitions`
+- [ ] Fix `SparkConf` deprecated setters
+- [ ] Update custom `UserDefinedFunction` registration
+- [ ] Migrate `Experimental` / `DeveloperApi` usages that were removed
+- [ ] Verify all compilation errors from Phase 2 are resolved
+
+---
+
+## Phase 4: Configuration Migration
+
+Spark renames and removes configuration properties between versions.
+
+See [references/config-changes.md](references/config-changes.md) for the full mapping.
+
+### Checklist
+
+- [ ] Rename deprecated config keys (e.g., `spark.shuffle.file.buffer.kb` → `spark.shuffle.file.buffer`)
+- [ ] Update removed configs to their replacements
+- [ ] Review `spark-defaults.conf`, application code, and submit scripts
+- [ ] Check for hardcoded config values in test fixtures
+- [ ] Verify `SparkSession.builder().config(...)` calls use current property names
+
+---
+
+## Phase 5: SQL & DataFrame Migration
+
+Spark SQL behavior changes between versions can silently alter query results.
+
+### Key Breaking Changes (2.x → 3.x)
+
+- `CAST` to integer no longer truncates silently — set `spark.sql.ansi.enabled` if needed
+- `FROM` clause is required in `SELECT` (no more `SELECT 1`)
+- Column resolution order changed in subqueries
+- `spark.sql.legacy.timeParserPolicy` controls date/time parsing behavior
+
+### Key Breaking Changes (3.x → 4.x)
+
+- ANSI mode is default (`spark.sql.ansi.enabled=true`)
+- Stricter type coercion in comparisons
+- `spark.sql.legacy.*` flags removed
+
+### Checklist
+
+- [ ] Audit SQL strings and DataFrame expressions for changed behavior
+- [ ] Add explicit `CAST` where implicit coercion relied on legacy behavior
+- [ ] Update date/time format patterns to match new parser
+- [ ] Test SQL queries with representative data and compare output to pre-upgrade baseline
+- [ ] Set `spark.sql.legacy.*` flags temporarily if needed for phased migration
+
+---
+
+## Phase 6: Test Validation
+
+### Checklist
+
+- [ ] All code compiles without errors
+- [ ] All existing unit tests pass
+- [ ] All existing integration tests pass
+- [ ] Run Spark jobs locally with sample data and compare output to pre-upgrade baseline
+- [ ] No deprecation warnings remain (or are documented with a migration timeline)
+- [ ] Update CI/CD pipeline to use new Spark version
+- [ ] Document any `spark.sql.legacy.*` flags that are set temporarily
+
+## Done When
+
+✓ Project compiles against target Spark version
+✓ All tests pass
+✓ No removed APIs remain in code
+✓ Configuration properties are current
+✓ SQL queries produce correct results
+✓ Upgrade impact documented in `spark_upgrade_impact.md`
\ No newline at end of file
diff --git a/skills/spark-version-upgrade/references/api-changes.md b/skills/spark-version-upgrade/references/api-changes.md
new file mode 100644
index 0000000..c4cfbe1
--- /dev/null
+++ b/skills/spark-version-upgrade/references/api-changes.md
@@ -0,0 +1,152 @@
+# Spark API Changes Reference
+
+## Spark 2.x → 3.x Removals
+
+### Core API
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `SparkContext.accumulator()` | `SparkContext.longAccumulator()` / `doubleAccumulator()` | Use `AccumulatorV2` for custom types |
+| `SparkContext.tachyonStore` | Removed entirely | Tachyon/Alluxio off-heap store dropped |
+| `RDD.mapPartitionsWithContext` | `RDD.mapPartitions` | Task context available via `TaskContext.get()` |
+| `RDD.toJavaRDD()` (implicit) | Explicit `JavaRDD.fromRDD(rdd)` | Implicit conversions tightened |
+
+### SQL API
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `SQLContext` | `SparkSession` | Use `spark.sql(...)` instead of `sqlContext.sql(...)` |
+| `HiveContext` | `SparkSession.builder().enableHiveSupport()` | |
+| `DataFrame` (type alias) | `Dataset[Row]` | `DataFrame` still works as alias but prefer `Dataset[Row]` |
+| `createExternalTable` | `createTable` | Method renamed |
+| `registerTempTable` | `createOrReplaceTempView` | |
+| `SQLContext.read` | `SparkSession.read` | |
+| `SQLContext.createDataFrame` | `SparkSession.createDataFrame` | |
+
+### Streaming
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `DStream` API (spark-streaming) | Structured Streaming (`spark-sql`) | DStream still works but is maintenance-only |
+| `StreamingContext.awaitTermination` | `SparkSession.streams.awaitAnyTermination` | For Structured Streaming |
+| `StreamingContext.remember` | Watermark-based state management | |
+
+### ML / MLlib
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `org.apache.spark.mllib` (RDD-based) | `org.apache.spark.ml` (DataFrame-based) | RDD-based MLlib is deprecated |
+| `LabeledPoint` from mllib | `ml.feature` transformers | Use DataFrame pipelines |
+| `mllib.classification.SVMWithSGD` | `ml.classification.LinearSVC` | |
+| `mllib.clustering.KMeans` | `ml.clustering.KMeans` | Same algorithm, DataFrame API |
+
+---
+
+## Spark 3.x → 4.x Removals
+
+### Core API
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `SparkContext.hadoopConfiguration` (mutable) | `SparkSession.sessionState.newHadoopConf()` | Per-session Hadoop config |
+| `JavaSparkContext.sc()` | `JavaSparkContext.toSparkContext()` | Method renamed |
+| Scala 2.12 support | Scala 2.13 only | All `_2.12` artifacts dropped |
+| Java 8/11 support | Java 17+ required | |
+
+### SQL API
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `spark.sql.legacy.*` flags | No replacement — ANSI behavior is permanent | Audit all legacy flags |
+| Non-ANSI `CAST` behavior | Explicit error handling or `TRY_CAST` | Overflows now throw errors |
+| `spark.sql.legacy.timeParserPolicy` | New parser is default | Joda → java.time |
+| Implicit type coercion in comparisons | Explicit `CAST` required | `string = int` no longer auto-casts |
+
+### PySpark
+
+| Removed | Replacement | Notes |
+|---------|-------------|-------|
+| `pyspark.sql.types.ArrayType.containsNull` default change | Explicitly set `containsNull=True` | Default changed |
+| `DataFrame.toJSON()` returns `Dataset[String]` | `.collect()` to materialize | Behavior aligned with Scala |
+| Python 3.8 support | Python 3.9+ required | |
+
+---
+
+## Common Migration Patterns
+
+### Pattern: SQLContext → SparkSession
+
+```scala
+// BEFORE
+val conf = new SparkConf().setAppName("MyApp")
+val sc = new SparkContext(conf)
+val sqlContext = new SQLContext(sc)
+val df = sqlContext.read.json("data.json")
+
+// AFTER
+val spark = SparkSession.builder()
+  .appName("MyApp")
+  .getOrCreate()
+val df = spark.read.json("data.json")
+```
+
+### Pattern: Accumulator v1 → v2
+
+```scala
+// BEFORE
+val counter = sc.accumulator(0, "my-counter")
+rdd.foreach(x => counter += 1)
+
+// AFTER
+val counter = sc.longAccumulator("my-counter")
+rdd.foreach(x => counter.add(1))
+```
+
+### Pattern: registerTempTable → createOrReplaceTempView
+
+```scala
+// BEFORE
+df.registerTempTable("my_table")
+
+// AFTER
+df.createOrReplaceTempView("my_table")
+```
+
+### Pattern: PySpark UDF Registration
+
+```python
+# BEFORE (Spark 2.x)
+from pyspark.sql.functions import udf
+from pyspark.sql.types import StringType
+my_udf = udf(lambda x: x.upper(), StringType())
+
+# AFTER (Spark 3.x+ preferred)
+from pyspark.sql.functions import udf
+from pyspark.sql.types import StringType
+
+@udf(returnType=StringType())
+def my_udf(x):
+    return x.upper()
+```
+
+### Pattern: ANSI Mode Error Handling (3.x → 4.x)
+
+```sql
+-- BEFORE (non-ANSI, returns NULL on overflow)
+SELECT CAST('999999999999' AS INT)
+
+-- AFTER (ANSI mode, throws error — use TRY_CAST for NULL behavior)
+SELECT TRY_CAST('999999999999' AS INT)
+```
+
+### Pattern: Date/Time Parsing (2.x → 3.x)
+
+```scala
+// BEFORE (lenient Joda-based parsing)
+spark.sql("SELECT to_date('2023-1-5', 'yyyy-MM-dd')")
+
+// AFTER (strict java.time parsing — single-digit month/day needs adjusted pattern)
+spark.sql("SELECT to_date('2023-1-5', 'yyyy-M-d')")
+// Or set legacy policy temporarily:
+// spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
+```
diff --git a/skills/spark-version-upgrade/references/config-changes.md b/skills/spark-version-upgrade/references/config-changes.md
new file mode 100644
index 0000000..27a1f00
--- /dev/null
+++ b/skills/spark-version-upgrade/references/config-changes.md
@@ -0,0 +1,78 @@
+# Spark Configuration Changes Reference
+
+## Spark 2.x → 3.x: Renamed Properties
+
+| Old Property | New Property |
+|-------------|-------------|
+| `spark.shuffle.file.buffer.kb` | `spark.shuffle.file.buffer` |
+| `spark.shuffle.consolidateFiles` | Removed (always consolidated) |
+| `spark.reducer.maxMbInFlight` | `spark.reducer.maxSizeInFlight` |
+| `spark.shuffle.memoryFraction` | Removed (unified memory management) |
+| `spark.storage.memoryFraction` | Removed (unified memory management) |
+| `spark.storage.unrollFraction` | Removed (unified memory management) |
+| `spark.yarn.am.port` | `spark.driver.port` |
+| `spark.tachyonStore.baseDir` | Removed |
+| `spark.tachyonStore.url` | Removed |
+| `spark.sql.tungsten.enabled` | Removed (always enabled) |
+| `spark.sql.codegen.wholeStage` | `spark.sql.codegen.wholeStage` (default changed to `true`) |
+| `spark.sql.parquet.int96AsTimestamp` | `spark.sql.parquet.int96AsTimestamp` (default changed to `true`) |
+| `spark.sql.hive.convertMetastoreParquet` | Still valid but default behavior changed |
+| `spark.akka.*` | Removed (Akka replaced with Netty RPC) |
+
+## Spark 2.x → 3.x: New Important Defaults
+
+| Property | Old Default | New Default | Impact |
+|----------|------------|-------------|--------|
+| `spark.sql.adaptive.enabled` | `false` | `true` (3.2+) | AQE auto-optimizes shuffles and joins |
+| `spark.sql.ansi.enabled` | `false` | `false` (3.x) | Opt-in for strict SQL behavior |
+| `spark.sql.sources.partitionOverwriteMode` | `static` | `static` | Consider `dynamic` for INSERT OVERWRITE |
+| `spark.sql.legacy.timeParserPolicy` | N/A | `EXCEPTION` | Strict date/time parsing |
+| `spark.sql.legacy.createHiveTableByDefault` | `true` | `false` | Tables default to data source format |
+
+## Spark 3.x → 4.x: Removed Properties
+
+| Removed Property | Migration Action |
+|-----------------|-----------------|
+| `spark.sql.legacy.timeParserPolicy` | Remove — new parser is permanent |
+| `spark.sql.legacy.allowNegativeScaleOfDecimal` | Remove — negative scale not allowed |
+| `spark.sql.legacy.createHiveTableByDefault` | Remove — data source tables default |
+| `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` | Remove — native Avro is standard |
+| `spark.sql.legacy.setopsPrecedence.enabled` | Remove — SQL standard precedence permanent |
+| `spark.sql.legacy.exponentLiteralAsDecimal.enabled` | Remove — standard behavior permanent |
+| `spark.sql.legacy.allowHashOnMapType` | Remove |
+
+## Spark 3.x → 4.x: Changed Defaults
+
+| Property | Old Default (3.x) | New Default (4.x) | Impact |
+|----------|-------------------|-------------------|--------|
+| `spark.sql.ansi.enabled` | `false` | `true` | Overflow/cast errors throw instead of returning NULL |
+| `spark.sql.storeAssignmentPolicy` | `ANSI` | `STRICT` | Stricter type checking on INSERT |
+| `spark.sql.adaptive.coalescePartitions.enabled` | `true` | `true` | No change, but AQE behavior refined |
+| `spark.sql.sources.default` | `parquet` | `parquet` | No change |
+
+## How to Find Config Usage in Your Codebase
+
+```bash
+# Find all Spark config references
+grep -rn 'spark\.' --include='*.scala' --include='*.java' --include='*.py' \
+  --include='*.conf' --include='*.properties' --include='*.xml' --include='*.yaml'
+
+# Find legacy flags specifically
+grep -rn 'spark.sql.legacy' --include='*.scala' --include='*.java' --include='*.py' \
+  --include='*.conf' --include='*.properties'
+
+# Find spark-defaults.conf
+find . -name 'spark-defaults.conf' -o -name 'spark-env.sh'
+
+# Find spark-submit scripts with --conf flags
+grep -rn '\-\-conf spark\.' --include='*.sh' --include='*.bash' --include='Makefile'
+```
+
+## Migration Strategy for Legacy Flags
+
+When upgrading to Spark 4.x, `spark.sql.legacy.*` flags are removed. To migrate safely:
+
+1. **Audit**: List all `spark.sql.legacy.*` flags in your codebase
+2. **Test without them**: Remove each flag on Spark 3.x and run tests to surface failures
+3. **Fix code**: Update SQL/DataFrame code to work under non-legacy behavior
+4. **Then upgrade**: Bump to Spark 4.x after all legacy flags are eliminated

From d70d538dac0bd095dcc8ebca54a459288e4a85f9 Mon Sep 17 00:00:00 2001
From: Calvin Smithu <email@cjsmith.io>
Date: Tue, 17 Mar 2026 12:22:46 -0600
Subject: [PATCH 2/4] Remove local references, point to official Apache Spark
 migration guide

Replace references/api-changes.md and references/config-changes.md with
the canonical upstream URL:
https://spark.apache.org/docs/latest/migration-guide.html

Co-authored-by: openhands <openhands@all-hands.dev>
---
 skills/spark-version-upgrade/README.md        |   3 +-
 skills/spark-version-upgrade/SKILL.md         |   8 +-
 .../references/api-changes.md                 | 152 ------------------
 .../references/config-changes.md              |  78 ---------
 4 files changed, 5 insertions(+), 236 deletions(-)
 delete mode 100644 skills/spark-version-upgrade/references/api-changes.md
 delete mode 100644 skills/spark-version-upgrade/references/config-changes.md

diff --git a/skills/spark-version-upgrade/README.md b/skills/spark-version-upgrade/README.md
index 1e0d6bb..282bd15 100644
--- a/skills/spark-version-upgrade/README.md
+++ b/skills/spark-version-upgrade/README.md
@@ -39,8 +39,7 @@ This skill provides a structured, six-phase workflow for upgrading Apache Spark
 
 ## Reference Material
 
-- [references/api-changes.md](references/api-changes.md) — Catalog of removed/deprecated APIs with before/after code
-- [references/config-changes.md](references/config-changes.md) — Spark configuration property rename/removal mapping
+- [Apache Spark Migration Guide](https://spark.apache.org/docs/latest/migration-guide.html) — The official, up-to-date guide covering API removals, configuration changes, SQL behavior, PySpark, Structured Streaming, and MLlib for every Spark release
 
 ## Example Usage
 
diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md
index fe1ca87..979902b 100644
--- a/skills/spark-version-upgrade/SKILL.md
+++ b/skills/spark-version-upgrade/SKILL.md
@@ -115,7 +115,8 @@ Replace removed and deprecated APIs. Work through compiler errors systematically
 
 ### Common Patterns
 
-See [references/api-changes.md](references/api-changes.md) for a full catalog.
+Consult the official Apache Spark migration guide for the complete list of changes for each version:
+https://spark.apache.org/docs/latest/migration-guide.html
 
 #### SparkSession Creation (2.x → 3.x)
 
@@ -168,9 +169,8 @@ val acc = sc.longAccumulator("name")
 
 ## Phase 4: Configuration Migration
 
-Spark renames and removes configuration properties between versions.
-
-See [references/config-changes.md](references/config-changes.md) for the full mapping.
+Spark renames and removes configuration properties between versions. The official migration guide documents every renamed and removed property per release:
+https://spark.apache.org/docs/latest/migration-guide.html
 
 ### Checklist
 
diff --git a/skills/spark-version-upgrade/references/api-changes.md b/skills/spark-version-upgrade/references/api-changes.md
deleted file mode 100644
index c4cfbe1..0000000
--- a/skills/spark-version-upgrade/references/api-changes.md
+++ /dev/null
@@ -1,152 +0,0 @@
-# Spark API Changes Reference
-
-## Spark 2.x → 3.x Removals
-
-### Core API
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `SparkContext.accumulator()` | `SparkContext.longAccumulator()` / `doubleAccumulator()` | Use `AccumulatorV2` for custom types |
-| `SparkContext.tachyonStore` | Removed entirely | Tachyon/Alluxio off-heap store dropped |
-| `RDD.mapPartitionsWithContext` | `RDD.mapPartitions` | Task context available via `TaskContext.get()` |
-| `RDD.toJavaRDD()` (implicit) | Explicit `JavaRDD.fromRDD(rdd)` | Implicit conversions tightened |
-
-### SQL API
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `SQLContext` | `SparkSession` | Use `spark.sql(...)` instead of `sqlContext.sql(...)` |
-| `HiveContext` | `SparkSession.builder().enableHiveSupport()` | |
-| `DataFrame` (type alias) | `Dataset[Row]` | `DataFrame` still works as alias but prefer `Dataset[Row]` |
-| `createExternalTable` | `createTable` | Method renamed |
-| `registerTempTable` | `createOrReplaceTempView` | |
-| `SQLContext.read` | `SparkSession.read` | |
-| `SQLContext.createDataFrame` | `SparkSession.createDataFrame` | |
-
-### Streaming
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `DStream` API (spark-streaming) | Structured Streaming (`spark-sql`) | DStream still works but is maintenance-only |
-| `StreamingContext.awaitTermination` | `SparkSession.streams.awaitAnyTermination` | For Structured Streaming |
-| `StreamingContext.remember` | Watermark-based state management | |
-
-### ML / MLlib
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `org.apache.spark.mllib` (RDD-based) | `org.apache.spark.ml` (DataFrame-based) | RDD-based MLlib is deprecated |
-| `LabeledPoint` from mllib | `ml.feature` transformers | Use DataFrame pipelines |
-| `mllib.classification.SVMWithSGD` | `ml.classification.LinearSVC` | |
-| `mllib.clustering.KMeans` | `ml.clustering.KMeans` | Same algorithm, DataFrame API |
-
----
-
-## Spark 3.x → 4.x Removals
-
-### Core API
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `SparkContext.hadoopConfiguration` (mutable) | `SparkSession.sessionState.newHadoopConf()` | Per-session Hadoop config |
-| `JavaSparkContext.sc()` | `JavaSparkContext.toSparkContext()` | Method renamed |
-| Scala 2.12 support | Scala 2.13 only | All `_2.12` artifacts dropped |
-| Java 8/11 support | Java 17+ required | |
-
-### SQL API
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `spark.sql.legacy.*` flags | No replacement — ANSI behavior is permanent | Audit all legacy flags |
-| Non-ANSI `CAST` behavior | Explicit error handling or `TRY_CAST` | Overflows now throw errors |
-| `spark.sql.legacy.timeParserPolicy` | New parser is default | Joda → java.time |
-| Implicit type coercion in comparisons | Explicit `CAST` required | `string = int` no longer auto-casts |
-
-### PySpark
-
-| Removed | Replacement | Notes |
-|---------|-------------|-------|
-| `pyspark.sql.types.ArrayType.containsNull` default change | Explicitly set `containsNull=True` | Default changed |
-| `DataFrame.toJSON()` returns `Dataset[String]` | `.collect()` to materialize | Behavior aligned with Scala |
-| Python 3.8 support | Python 3.9+ required | |
-
----
-
-## Common Migration Patterns
-
-### Pattern: SQLContext → SparkSession
-
-```scala
-// BEFORE
-val conf = new SparkConf().setAppName("MyApp")
-val sc = new SparkContext(conf)
-val sqlContext = new SQLContext(sc)
-val df = sqlContext.read.json("data.json")
-
-// AFTER
-val spark = SparkSession.builder()
-  .appName("MyApp")
-  .getOrCreate()
-val df = spark.read.json("data.json")
-```
-
-### Pattern: Accumulator v1 → v2
-
-```scala
-// BEFORE
-val counter = sc.accumulator(0, "my-counter")
-rdd.foreach(x => counter += 1)
-
-// AFTER
-val counter = sc.longAccumulator("my-counter")
-rdd.foreach(x => counter.add(1))
-```
-
-### Pattern: registerTempTable → createOrReplaceTempView
-
-```scala
-// BEFORE
-df.registerTempTable("my_table")
-
-// AFTER
-df.createOrReplaceTempView("my_table")
-```
-
-### Pattern: PySpark UDF Registration
-
-```python
-# BEFORE (Spark 2.x)
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType
-my_udf = udf(lambda x: x.upper(), StringType())
-
-# AFTER (Spark 3.x+ preferred)
-from pyspark.sql.functions import udf
-from pyspark.sql.types import StringType
-
-@udf(returnType=StringType())
-def my_udf(x):
-    return x.upper()
-```
-
-### Pattern: ANSI Mode Error Handling (3.x → 4.x)
-
-```sql
--- BEFORE (non-ANSI, returns NULL on overflow)
-SELECT CAST('999999999999' AS INT)
-
--- AFTER (ANSI mode, throws error — use TRY_CAST for NULL behavior)
-SELECT TRY_CAST('999999999999' AS INT)
-```
-
-### Pattern: Date/Time Parsing (2.x → 3.x)
-
-```scala
-// BEFORE (lenient Joda-based parsing)
-spark.sql("SELECT to_date('2023-1-5', 'yyyy-MM-dd')")
-
-// AFTER (strict java.time parsing — single-digit month/day needs adjusted pattern)
-spark.sql("SELECT to_date('2023-1-5', 'yyyy-M-d')")
-// Or set legacy policy temporarily:
-// spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
-```
diff --git a/skills/spark-version-upgrade/references/config-changes.md b/skills/spark-version-upgrade/references/config-changes.md
deleted file mode 100644
index 27a1f00..0000000
--- a/skills/spark-version-upgrade/references/config-changes.md
+++ /dev/null
@@ -1,78 +0,0 @@
-# Spark Configuration Changes Reference
-
-## Spark 2.x → 3.x: Renamed Properties
-
-| Old Property | New Property |
-|-------------|-------------|
-| `spark.shuffle.file.buffer.kb` | `spark.shuffle.file.buffer` |
-| `spark.shuffle.consolidateFiles` | Removed (always consolidated) |
-| `spark.reducer.maxMbInFlight` | `spark.reducer.maxSizeInFlight` |
-| `spark.shuffle.memoryFraction` | Removed (unified memory management) |
-| `spark.storage.memoryFraction` | Removed (unified memory management) |
-| `spark.storage.unrollFraction` | Removed (unified memory management) |
-| `spark.yarn.am.port` | `spark.driver.port` |
-| `spark.tachyonStore.baseDir` | Removed |
-| `spark.tachyonStore.url` | Removed |
-| `spark.sql.tungsten.enabled` | Removed (always enabled) |
-| `spark.sql.codegen.wholeStage` | `spark.sql.codegen.wholeStage` (default changed to `true`) |
-| `spark.sql.parquet.int96AsTimestamp` | `spark.sql.parquet.int96AsTimestamp` (default changed to `true`) |
-| `spark.sql.hive.convertMetastoreParquet` | Still valid but default behavior changed |
-| `spark.akka.*` | Removed (Akka replaced with Netty RPC) |
-
-## Spark 2.x → 3.x: New Important Defaults
-
-| Property | Old Default | New Default | Impact |
-|----------|------------|-------------|--------|
-| `spark.sql.adaptive.enabled` | `false` | `true` (3.2+) | AQE auto-optimizes shuffles and joins |
-| `spark.sql.ansi.enabled` | `false` | `false` (3.x) | Opt-in for strict SQL behavior |
-| `spark.sql.sources.partitionOverwriteMode` | `static` | `static` | Consider `dynamic` for INSERT OVERWRITE |
-| `spark.sql.legacy.timeParserPolicy` | N/A | `EXCEPTION` | Strict date/time parsing |
-| `spark.sql.legacy.createHiveTableByDefault` | `true` | `false` | Tables default to data source format |
-
-## Spark 3.x → 4.x: Removed Properties
-
-| Removed Property | Migration Action |
-|-----------------|-----------------|
-| `spark.sql.legacy.timeParserPolicy` | Remove — new parser is permanent |
-| `spark.sql.legacy.allowNegativeScaleOfDecimal` | Remove — negative scale not allowed |
-| `spark.sql.legacy.createHiveTableByDefault` | Remove — data source tables default |
-| `spark.sql.legacy.replaceDatabricksSparkAvro.enabled` | Remove — native Avro is standard |
-| `spark.sql.legacy.setopsPrecedence.enabled` | Remove — SQL standard precedence permanent |
-| `spark.sql.legacy.exponentLiteralAsDecimal.enabled` | Remove — standard behavior permanent |
-| `spark.sql.legacy.allowHashOnMapType` | Remove |
-
-## Spark 3.x → 4.x: Changed Defaults
-
-| Property | Old Default (3.x) | New Default (4.x) | Impact |
-|----------|-------------------|-------------------|--------|
-| `spark.sql.ansi.enabled` | `false` | `true` | Overflow/cast errors throw instead of returning NULL |
-| `spark.sql.storeAssignmentPolicy` | `ANSI` | `STRICT` | Stricter type checking on INSERT |
-| `spark.sql.adaptive.coalescePartitions.enabled` | `true` | `true` | No change, but AQE behavior refined |
-| `spark.sql.sources.default` | `parquet` | `parquet` | No change |
-
-## How to Find Config Usage in Your Codebase
-
-```bash
-# Find all Spark config references
-grep -rn 'spark\.' --include='*.scala' --include='*.java' --include='*.py' \
-  --include='*.conf' --include='*.properties' --include='*.xml' --include='*.yaml'
-
-# Find legacy flags specifically
-grep -rn 'spark.sql.legacy' --include='*.scala' --include='*.java' --include='*.py' \
-  --include='*.conf' --include='*.properties'
-
-# Find spark-defaults.conf
-find . -name 'spark-defaults.conf' -o -name 'spark-env.sh'
-
-# Find spark-submit scripts with --conf flags
-grep -rn '\-\-conf spark\.' --include='*.sh' --include='*.bash' --include='Makefile'
-```
-
-## Migration Strategy for Legacy Flags
-
-When upgrading to Spark 4.x, `spark.sql.legacy.*` flags are removed. To migrate safely:
-
-1. **Audit**: List all `spark.sql.legacy.*` flags in your codebase
-2. **Test without them**: Remove each flag on Spark 3.x and run tests to surface failures
-3. **Fix code**: Update SQL/DataFrame code to work under non-legacy behavior
-4. **Then upgrade**: Bump to Spark 4.x after all legacy flags are eliminated

From 8f2e4ffa9b2eb6304c0436a893b0d00094014f42 Mon Sep 17 00:00:00 2001
From: Calvin Smithu <email@cjsmith.io>
Date: Tue, 17 Mar 2026 12:23:54 -0600
Subject: [PATCH 3/4] Move Apache migration guide reference to Phase 1
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The guide is most relevant during inventory — read it before making changes.

Co-authored-by: openhands <openhands@all-hands.dev>
---
 skills/spark-version-upgrade/SKILL.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/skills/spark-version-upgrade/SKILL.md b/skills/spark-version-upgrade/SKILL.md
index 979902b..41cfd47 100644
--- a/skills/spark-version-upgrade/SKILL.md
+++ b/skills/spark-version-upgrade/SKILL.md
@@ -34,10 +34,12 @@ Upgrade Apache Spark applications between major versions with a structured, phas
 
 ## Phase 1: Inventory & Impact Analysis
 
-Before changing any code, assess what needs to change.
+Before changing any code, assess what needs to change. Read the official Apache Spark migration guide for the target version — it documents every API removal, config rename, and behavioral change per release:
+https://spark.apache.org/docs/latest/migration-guide.html
 
 ### Checklist
 
+- [ ] Read the migration guide section for the target Spark version
 - [ ] Identify current Spark version (check `pom.xml`, `build.sbt`, `build.gradle`, or `requirements.txt`)
 - [ ] Identify target Spark version
 - [ ] Search for deprecated APIs: `grep -rn 'import org.apache.spark' --include='*.scala' --include='*.java' --include='*.py'`

From 3cc55a8bff3a28ce7877e837fd2f1acf2736cfbc Mon Sep 17 00:00:00 2001
From: Calvin Smithu <email@cjsmith.io>
Date: Tue, 17 Mar 2026 12:32:26 -0600
Subject: [PATCH 4/4] updating marketplace guidance

---
 AGENTS.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index e7636b9..cba9865 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -92,8 +92,7 @@ When editing or adding skills in this repo, follow these rules (and add new skil
 
 ## CI / validation gotchas
 
-- The test suite expects **every directory under `skills/`** to be listed in `marketplaces/default.json`.
-  - If you add a new skill (or rebase onto a main branch that added skills), update the marketplace file or CI will fail with `Skills missing from marketplace: [...]`.
+- The test suite expects **every directory under `skills/`** to be listed in a marketplace. If you add a new skill (or rebase onto a main branch that added skills), update the appropriate marketplace file or CI will fail with `Skills missing from marketplace: [...]`.
 
 ## PR review plugin notes